Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
System disruptions
We are currently experiencing disruptions on the search portals due to high traffic. We are working to resolve the issue, you may temporarily encounter an error message.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A joint use of pooling and imputation for genotyping SNPs
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science.
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science.ORCID iD: 0000-0002-6212-539x
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science.ORCID iD: 0000-0003-0458-6902
2022 (English)In: BMC Bioinformatics, E-ISSN 1471-2105, Vol. 23, article id 421Article in journal (Refereed) Published
Abstract [en]

Background

Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented.

Results

We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts.

Conclusions

We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding.

Place, publisher, year, edition, pages
Springer Nature, 2022. Vol. 23, article id 421
Keywords [en]
Pooling, Imputation, Genotyping
National Category
Bioinformatics (Computational Biology)
Research subject
Scientific Computing
Identifiers
URN: urn:nbn:se:uu:diva-486864DOI: 10.1186/s12859-022-04974-7ISI: 000867656900001PubMedID: 36229780OAI: oai:DiVA.org:uu-486864DiVA, id: diva2:1704586
Projects
eSSENCE - An eScience Collaboration
Funder
Swedish Research Council Formas, 2017-00453Swedish National Infrastructure for Computing (SNIC), 2019/8-216Swedish National Infrastructure for Computing (SNIC), 2020/5-455Uppsala UniversityAvailable from: 2022-10-18 Created: 2022-10-18 Last updated: 2024-01-17Bibliographically approved
In thesis
1. Computational statistical methods for genotyping biallelic DNA markers from pooled experiments
Open this publication in new window or tab >>Computational statistical methods for genotyping biallelic DNA markers from pooled experiments
2022 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The information conveyed by genetic markers such as Single Nucleotide Polymorphisms (SNPs) has been widely used in biomedical research for studying human diseases, but also increasingly in agriculture by plant and animal breeders for selection purposes. Specific identified markers can act as a genetic signature that is correlated to certain characteristics in a living organism, e.g. a sensitivity to a disease or high-yield traits. Capturing these signatures with sufficient statistical power often requires large volumes of data, with thousands of samples to analyze and possibly millions of genetic markers to screen. Establishing statistical significance for effects from genetic variations is especially delicate when they occur at low frequencies.

The production cost of such marker genotype data is thereforea critical part of the analysis. Despite recent technological advances, the production cost can still be prohibitive and genotype imputation strategies have been developed for addressing this issue. The genotype imputation methods have been widely investigated on human data and to a smaller extent on crop and animal species. In the case where only few reference genomes are available for imputation purposes, such as for non-model organisms, the imputation results can be less accurate. Group testing strategies, also called pooling strategies, can be well-suited for complementing imputation in large populations and decreasing the number of genotyping tests required compared to the single testing of every individual. Pooling is especially efficient for genotyping the low-frequency variants. However, because of the particular nature of genotype data and because of the limitations inherent to the genotype testing techniques, decoding pooled genotypes into unique data resolutions is a challenge. Overall, the decoding problem with pooled genotypes can be described as as an inference problem in Missing Not At Random data with nonmonotone missingness patterns.

Specific inference methods such as variations of the Expectation-Maximization algorithm can be used for resolving the pooled data into estimates of the genotype probabilities for every individual. However, the non-randomness of the undecoded data impacts the outcomes of the inference process. This impact is propagated to imputation if the inferred genotype probabilities are to be devised as input into classical imputation methods for genotypes. In this work, we propose a study of the specific characteristics of a pooling scheme on genotype data, as well as how it affects the results of imputation methods such as tree-based haplotype clustering or coalescent models.

Place, publisher, year, edition, pages
Uppsala: Uppsala University, 2022. p. 57
Series
Information technology licentiate theses: Licentiate theses from the Department of Information Technology, ISSN 1404-5117 ; 2022-003
National Category
Other Computer and Information Science Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:uu:diva-486868 (URN)
Presentation
2022-11-09, 101130, Ångström, 13:15 (English)
Opponent
Supervisors
Funder
Swedish Research Council Formas, 2017-00453
Available from: 2022-10-31 Created: 2022-10-19 Last updated: 2022-10-31Bibliographically approved
2. A computational and statistical framework for cost-effective genotyping combining pooling and imputation
Open this publication in new window or tab >>A computational and statistical framework for cost-effective genotyping combining pooling and imputation
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The information conveyed by genetic markers, such as single nucleotide polymorphisms (SNPs), has been widely used in biomedical research to study human diseases and is increasingly valued in agriculture for genomic selection purposes. Specific markers can be identified as a genetic signature that correlates with certain characteristics in a living organism, e.g. a susceptibility to disease or high-yield traits. Capturing these signatures with sufficient statistical power often requires large volumes of data, with thousands of samples to be analysed and potentially millions of genetic markers to be screened. Relevant effects are particularly delicate to detect when the genetic variations involved occur at low frequencies.

The cost of producing such marker genotype data is therefore a critical part of the analysis. Despite recent technological advances, production costs can still be prohibitive on a large scale and genotype imputation strategies have been developed to address this issue. Genotype imputation methods have been extensively studied in human data and, to a lesser extent, in crop and animal species. A recognised weakness of imputation methods is their lower accuracy in predicting the genotypes for rare variants, whereas those can be highly informative in association studies and improve the accuracy of genomic selection. In this respect, pooling strategies can be well suited to complement imputation, as pooling is efficient at capturing the low-frequency items in a population. Pooling also reduces the number of genotyping tests required, making its use in combination with imputation a cost-effective compromise between accurate but expensive high-density genotyping of each sample individually and stand-alone imputation. However, due to the nature of genotype data and the limitations of genotype testing techniques, decoding pooled genotypes into unique data resolutions is challenging. 

In this work, we study the characteristics of decoded genotype data from pooled observations with a specific pooling scheme using the examples of a human cohort and a population of inbred wheat lines. We propose different inference strategies to reconstruct the genotypes before devising them as input to imputation, and we reflect on how the reconstructed distributions affect the results of imputation methods such as tree-based haplotype clustering or coalescent models.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2024. p. 81
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 2354
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:uu:diva-519887 (URN)978-91-513-2006-9 (ISBN)
Public defence
2024-03-08, 101195 (Heinz-Otto Kreiss), Ångströmlaboratoriet, Lägerhyddsvägen 1, hus 10, Uppsala, 10:15 (English)
Opponent
Supervisors
Funder
Swedish Research Council Formas, 2017-00453
Available from: 2024-02-08 Created: 2024-01-10 Last updated: 2024-02-08

Open Access in DiVA

fulltext(2778 kB)268 downloads
File information
File name FULLTEXT01.pdfFile size 2778 kBChecksum SHA-512
81a2bf43de00994aefbf653a4438f07eef0071540c79321c08bc1535f67e511b326ca063129d3dd45245a8171e35cef9648b4986f8a6974c2a5e7359d14aca60
Type fulltextMimetype application/pdf

Other links

Publisher's full textPubMed

Authority records

Clouard, CamilleAusmees, KristiinaNettelblad, Carl

Search in DiVA

By author/editor
Clouard, CamilleAusmees, KristiinaNettelblad, Carl
By organisation
Division of Scientific ComputingComputational Science
In the same journal
BMC Bioinformatics
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 268 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 366 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf