Logotyp: till Uppsala universitets webbplats

uu.sePublikationer från Uppsala universitet
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
A computational and statistical framework for cost-effective genotyping combining pooling and imputation
Uppsala universitet, Teknisk-naturvetenskapliga vetenskapsområdet, Matematisk-datavetenskapliga sektionen, Institutionen för informationsteknologi, Avdelningen för beräkningsvetenskap.ORCID-id: 0009-0006-3654-6525
2024 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Fritextbeskrivning
Abstract [en]

The information conveyed by genetic markers, such as single nucleotide polymorphisms (SNPs), has been widely used in biomedical research to study human diseases and is increasingly valued in agriculture for genomic selection purposes. Specific markers can be identified as a genetic signature that correlates with certain characteristics in a living organism, e.g. a susceptibility to disease or high-yield traits. Capturing these signatures with sufficient statistical power often requires large volumes of data, with thousands of samples to be analysed and potentially millions of genetic markers to be screened. Relevant effects are particularly delicate to detect when the genetic variations involved occur at low frequencies.

The cost of producing such marker genotype data is therefore a critical part of the analysis. Despite recent technological advances, production costs can still be prohibitive on a large scale and genotype imputation strategies have been developed to address this issue. Genotype imputation methods have been extensively studied in human data and, to a lesser extent, in crop and animal species. A recognised weakness of imputation methods is their lower accuracy in predicting the genotypes for rare variants, whereas those can be highly informative in association studies and improve the accuracy of genomic selection. In this respect, pooling strategies can be well suited to complement imputation, as pooling is efficient at capturing the low-frequency items in a population. Pooling also reduces the number of genotyping tests required, making its use in combination with imputation a cost-effective compromise between accurate but expensive high-density genotyping of each sample individually and stand-alone imputation. However, due to the nature of genotype data and the limitations of genotype testing techniques, decoding pooled genotypes into unique data resolutions is challenging. 

In this work, we study the characteristics of decoded genotype data from pooled observations with a specific pooling scheme using the examples of a human cohort and a population of inbred wheat lines. We propose different inference strategies to reconstruct the genotypes before devising them as input to imputation, and we reflect on how the reconstructed distributions affect the results of imputation methods such as tree-based haplotype clustering or coalescent models.

Ort, förlag, år, upplaga, sidor
Uppsala: Acta Universitatis Upsaliensis, 2024. , s. 81
Serie
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 2354
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Identifikatorer
URN: urn:nbn:se:uu:diva-519887ISBN: 978-91-513-2006-9 (tryckt)OAI: oai:DiVA.org:uu-519887DiVA, id: diva2:1825852
Disputation
2024-03-08, 101195 (Heinz-Otto Kreiss), Ångströmlaboratoriet, Lägerhyddsvägen 1, hus 10, Uppsala, 10:15 (Engelska)
Opponent
Handledare
Forskningsfinansiär
Forskningsrådet Formas, 2017-00453Tillgänglig från: 2024-02-08 Skapad: 2024-01-10 Senast uppdaterad: 2024-02-08
Delarbeten
1. A joint use of pooling and imputation for genotyping SNPs
Öppna denna publikation i ny flik eller fönster >>A joint use of pooling and imputation for genotyping SNPs
2022 (Engelska)Ingår i: BMC Bioinformatics, E-ISSN 1471-2105, Vol. 23, artikel-id 421Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Background

Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented.

Results

We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts.

Conclusions

We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding.

Ort, förlag, år, upplaga, sidor
Springer Nature, 2022
Nyckelord
Pooling, Imputation, Genotyping
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-486864 (URN)10.1186/s12859-022-04974-7 (DOI)000867656900001 ()36229780 (PubMedID)
Projekt
eSSENCE - An eScience Collaboration
Forskningsfinansiär
Forskningsrådet Formas, 2017-00453Swedish National Infrastructure for Computing (SNIC), 2019/8-216Swedish National Infrastructure for Computing (SNIC), 2020/5-455Uppsala universitet
Tillgänglig från: 2022-10-18 Skapad: 2022-10-18 Senast uppdaterad: 2024-01-17Bibliografiskt granskad
2. Consistency Study of a Reconstructed Genotype Probability Distribution via Clustered Bootstrapping in NORB Pooling Blocks
Öppna denna publikation i ny flik eller fönster >>Consistency Study of a Reconstructed Genotype Probability Distribution via Clustered Bootstrapping in NORB Pooling Blocks
2022 (Engelska)Rapport (Övrigt vetenskapligt)
Abstract [en]

For applications with biallelic genetic markers, group testing techniques, synonymous to pooling techniques, are usually applied for decreasing the cost of large-scale testing as e.g. when detecting carriers of rare genetic variants. In some configurations, the results of the grouped tests cannot be decoded and the pooled items are missing. Inference of these missing items can be performed with specific statistical methods that are for example related to the Expectation-Maximization algorithm. Pooling has also been applied for determining the genotype of markers in large populations. The particularity of full genotype data for diploid organisms in the context of group testing are the ternary outcomes (two homozygous genotypes and one heterozygous), as well as the distribution of these three outcomes in a population, which is often ruled by the Hardy-Weinberg Equilibrium and depends on the allele frequency in such situation. When using a nonoverlapping repeated block pooling design, the missing items are only observed in particular arrangements. Overall, a data set of pooled genotypes can be described as an inference problem in Missing Not At Random data with nonmonotone missingness patterns. This study presents a preliminary investigation of the consistency of various iterative methods estimating the most likely genotype probabilities of the missing items in pooled data. We use the Kullback-Leibler divergence and the L2 distance between the genotype distribution computed from our estimates and a simulated empirical distribution as a measure of the distributional consistency.

Ort, förlag, år, upplaga, sidor
Uppsala: Uppsala University, 2022. s. 13
Serie
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2022-005
Nationell ämneskategori
Sannolikhetsteori och statistik
Identifikatorer
urn:nbn:se:uu:diva-487718 (URN)
Tillgänglig från: 2022-10-31 Skapad: 2022-10-31 Senast uppdaterad: 2024-01-10Bibliografiskt granskad
3. Genotyping of SNPs in bread wheat at reduced cost from pooled experiments and imputation
Öppna denna publikation i ny flik eller fönster >>Genotyping of SNPs in bread wheat at reduced cost from pooled experiments and imputation
2024 (Engelska)Ingår i: Theoretical and Applied Genetics, ISSN 0040-5752, E-ISSN 1432-2242, Vol. 137, nr 1, artikel-id 26Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

The plant breeding industry has shown growing interest in using the genotype data of relevant markers for performing selection of new competitive varieties. The selection usually benefits from large amounts of marker data and it is therefore crucial to dispose of data collection methods that are both cost-effective and reliable. Computational methods such as genotype imputation have been proposed earlier in several plant science studies for addressing the cost challenge. Genotype imputation methods have though been used more frequently and investigated more extensively in human genetics research. The various algorithms that exist have shown lower accuracy at inferring the genotype of genetic variants occurring at low frequency, while these rare variants can have great significance and impact in the genetic studies that underlie selection. In contrast, pooling is a technique that can efficiently identify low-frequency items in a population and it has been successfully used for detecting the samples that carry rare variants in a population. In this study, we propose to combine pooling and imputation, and demonstrate this by simulating a hypothetical microarray for genotyping a population of recombinant inbred lines in a cost-effective and accurate manner, even for rare variants. We show that with an adequate imputation model, it is feasible to accurately predictthe individual genotypes at lower cost than sample-wise genotyping and time-effectively. Moreover, we provide code resources for reproducing the results presented in this study in the form of a containerized workflow.

Ort, förlag, år, upplaga, sidor
Springer Nature, 2024
Nyckelord
genotyping, imputation, MAGIC population, pooling, wheat
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-518436 (URN)10.1007/s00122-023-04533-5 (DOI)001145311600001 ()38243086 (PubMedID)
Projekt
eSSENCE - An eScience Collaboration
Forskningsfinansiär
Forskningsrådet Formas, 2017-00453
Tillgänglig från: 2024-01-10 Skapad: 2024-01-10 Senast uppdaterad: 2025-01-07Bibliografiskt granskad
4. Using feedback in pooled experiments augmented with imputation for high genotyping accuracy at reduced cost
Öppna denna publikation i ny flik eller fönster >>Using feedback in pooled experiments augmented with imputation for high genotyping accuracy at reduced cost
2025 (Engelska)Ingår i: G3: Genes, Genomes, Genetics, E-ISSN 2160-1836, artikel-id jkaf010Artikel i tidskrift (Refereegranskat) Epub ahead of print
Abstract [en]

Conducting genomic selection in plant breeding programs can substantially speed up the development of new varieties. Genomic selection provides more reliable insights when it is based on dense marker data, in which the rare variants can be particularly informative. Despite the availability of new technologies, the cost of large-scale genotyping remains a major limitation to the implementation of genomic selection. We suggest to combine pooled genotyping with population-based imputation as a cost-effective computational strategy for genotyping SNPs. Pooling saves genotyping tests and has proven to accurately capture the rare variants that are usually missed by imputation. In this study, we investigate adding iterative coupling to a joint model of pooling and imputation that we have previously proposed. In each iteration, the imputed genotype probabilities serve as feedback input for adjusting the per-sample prior genotype probabilities, before running a new imputation based on these adjusted data. This flexible setup indirectly imposes consistency between the imputed genotypes and the pooled observations. We demonstrate that repeated cycles of feedback can take advantage of the strengths in both pooling and imputation when an appropriate set of reference haplotypes is available for imputation. The iterations improve greatly upon the initial genotype predictions, achieving very high genotype accuracy for both low and high frequency variants. We enhance the average concordance from 94.5% to 98.4% at limited computational cost and without requiring any additional genotype testing.

Ort, förlag, år, upplaga, sidor
Oxford University Press, 2025
Nyckelord
SNP array, pooling, imputation, iterative refinement
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-518429 (URN)10.1093/g3journal/jkaf010 (DOI)
Forskningsfinansiär
Forskningsrådet Formas, 2017-00453
Tillgänglig från: 2023-12-19 Skapad: 2023-12-19 Senast uppdaterad: 2025-01-24

Open Access i DiVA

UUThesis_C-Clouard-2024(1029 kB)261 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 1029 kBChecksumma SHA-512
a136b90991c3c2e6f9384cc7a2b44e715803d1d65e02dd6a96cad89309b14ace638cbbbd8b1a1757b3bf28616e87a8b70e9669995bf62135bc9f4291bfec1066
Typ fulltextMimetyp application/pdf

Person

Clouard, Camille

Sök vidare i DiVA

Av författaren/redaktören
Clouard, Camille
Av organisationen
Avdelningen för beräkningsvetenskap
Bioinformatik (beräkningsbiologi)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 261 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 1108 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf