Logo: to the web site of Uppsala University

uu.sePublikasjoner fra Uppsala universitet
Endre søk
Link to record
Permanent link

Direct link
Publikasjoner (10 av 10) Visa alla publikasjoner
Ausmees, K. & Nettelblad, C. (2023). Achieving improved accuracy for imputation of ancient DNA. Bioinformatics, 39(1), Article ID btac738.
Åpne denne publikasjonen i ny fane eller vindu >>Achieving improved accuracy for imputation of ancient DNA
2023 (engelsk)Inngår i: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 39, nr 1, artikkel-id btac738Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Motivation

Genotype imputation has the potential to increase the amount of information that can be gained from the often limited biological material available in ancient samples. As many widely used tools have been developed with modern data in mind, their design is not necessarily reflective of the requirements in studies of ancient DNA. Here, we investigate if an imputation method based on the full probabilistic Li and Stephens model of haplotype frequencies might be beneficial for the particular challenges posed by ancient data.

Results

We present an implementation called prophaser and compare imputation performance to two alternative pipelines that have been used in the ancient DNA community based on the Beagle software. Considering empirical ancient data downsampled to lower coverages as well as present-day samples with artificially thinned genotypes, we show that the proposed method is advantageous at lower coverages, where it yields improved accuracy and ability to capture rare variation. The software prophaser is optimized for running in a massively parallel manner and achieved reasonable runtimes on the experiments performed when executed on a GPU.

sted, utgiver, år, opplag, sider
Oxford University Press, 2023
Emneord
Bioinformatics, Computational biology, Genotype imputation, Ancient DNA
HSV kategori
Forskningsprogram
Bioinformatik; Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-470292 (URN)10.1093/bioinformatics/btac738 (DOI)000892594400001 ()36377787 (PubMedID)
Prosjekter
eSSENCE - An eScience Collaboration
Forskningsfinansiär
Swedish Research Council Formas, 2017-00453
Tilgjengelig fra: 2022-03-22 Laget: 2022-03-22 Sist oppdatert: 2025-02-01bibliografisk kontrollert
Ausmees, K. & Nettelblad, C. (2022). A deep learning framework for characterization of genotype data. G3: Genes, Genomes, Genetics, 12(3), Article ID jkac020.
Åpne denne publikasjonen i ny fane eller vindu >>A deep learning framework for characterization of genotype data
2022 (engelsk)Inngår i: G3: Genes, Genomes, Genetics, E-ISSN 2160-1836, Vol. 12, nr 3, artikkel-id jkac020Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Dimensionality reduction is a data transformation technique widely used in various fields of genomics research. The application of dimensionality reduction to genotype data is known to capture genetic similarity between individuals, and is used for visualization of genetic variation, identification of population structure as well as ancestry mapping. Among frequently used methods are principal component analysis, which is a linear transform that often misses more fine-scale structures, and neighbor-graph based methods which focus on local relationships rather than large-scale patterns. Deep learning models are a type of nonlinear machine learning method in which the features used in data transformation are decided by the model in a data-driven manner, rather than by the researcher, and have been shown to present a promising alternative to traditional statistical methods for various applications in omics research. In this study, we propose a deep learning model based on a convolutional autoencoder architecture for dimensionality reduction of genotype data. Using a highly diverse cohort of human samples, we demonstrate that the model can identify population clusters and provide richer visual information in comparison to principal component analysis, while preserving global geometry to a higher extent than t-SNE and UMAP, yielding results that are comparable to an alternative deep learning approach based on variational autoencoders. We also discuss the use of the methodology for more general characterization of genotype data, showing that it preserves spatial properties in the form of decay of linkage disequilibrium with distance along the genome and demonstrating its use as a genetic clustering method, comparing results to the ADMIXTURE software frequently used in population genetic studies.

sted, utgiver, år, opplag, sider
Oxford University PressOxford University Press (OUP), 2022
Emneord
deep learning, convolutional autoencoder, dimensionality reduction, genetic clustering, population genetics
HSV kategori
Forskningsprogram
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-470290 (URN)10.1093/g3journal/jkac020 (DOI)000776673300018 ()35078229 (PubMedID)
Prosjekter
eSSENCE - An eScience Collaboration
Forskningsfinansiär
Swedish Research Council Formas, 2017-00453Swedish Research Council Formas, 2020-00712
Tilgjengelig fra: 2022-03-22 Laget: 2022-03-22 Sist oppdatert: 2025-02-01bibliografisk kontrollert
Clouard, C., Ausmees, K. & Nettelblad, C. (2022). A joint use of pooling and imputation for genotyping SNPs. BMC Bioinformatics, 23, Article ID 421.
Åpne denne publikasjonen i ny fane eller vindu >>A joint use of pooling and imputation for genotyping SNPs
2022 (engelsk)Inngår i: BMC Bioinformatics, E-ISSN 1471-2105, Vol. 23, artikkel-id 421Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Background

Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented.

Results

We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts.

Conclusions

We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding.

sted, utgiver, år, opplag, sider
Springer Nature, 2022
Emneord
Pooling, Imputation, Genotyping
HSV kategori
Forskningsprogram
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-486864 (URN)10.1186/s12859-022-04974-7 (DOI)000867656900001 ()36229780 (PubMedID)
Prosjekter
eSSENCE - An eScience Collaboration
Forskningsfinansiär
Swedish Research Council Formas, 2017-00453Swedish National Infrastructure for Computing (SNIC), 2019/8-216Swedish National Infrastructure for Computing (SNIC), 2020/5-455Uppsala University
Tilgjengelig fra: 2022-10-18 Laget: 2022-10-18 Sist oppdatert: 2024-01-17bibliografisk kontrollert
Ausmees, K., Sanchez-Quinto, F., Jakobsson, M. & Nettelblad, C. (2022). An empirical evaluation of genotype imputation of ancient DNA. G3: Genes, Genomes, Genetics, 12(6), Article ID jkac089.
Åpne denne publikasjonen i ny fane eller vindu >>An empirical evaluation of genotype imputation of ancient DNA
2022 (engelsk)Inngår i: G3: Genes, Genomes, Genetics, E-ISSN 2160-1836, Vol. 12, nr 6, artikkel-id jkac089Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

With capabilities of sequencing ancient DNA to high coverage often limited by sample quality or cost, imputation of missing genotypes presents a possibility to increase the power of inference as well as cost-effectiveness for the analysis of ancient data. However, the high degree of uncertainty often associated with ancient DNA poses several methodological challenges, and performance of imputation methods in this context has not been fully explored. To gain further insights, we performed a systematic evaluation of imputation of ancient data using Beagle v4.0 and reference data from phase 3 of the 1000 Genomes project, investigating the effects of coverage, phased reference, and study sample size. Making use of five ancient individuals with high-coverage data available, we evaluated imputed data for accuracy, reference bias, and genetic affinities as captured by principal component analysis. We obtained genotype concordance levels of over 99% for data with 1× coverage, and similar levels of accuracy and reference bias at levels as low as 0.75×. Our findings suggest that using imputed data can be a realistic option for various population genetic analyses even for data in coverage ranges below 1×. We also show that a large and varied phased reference panel as well as the inclusion of low- to moderate-coverage ancient individuals in the study sample can increase imputation performance, particularly for rare alleles. In-depth analysis of imputed data with respect to genetic variants and allele frequencies gave further insight into the nature of errors arising during imputation, and can provide practical guidelines for postprocessing and validation prior to downstream analysis.

sted, utgiver, år, opplag, sider
Oxford University Press, 2022
HSV kategori
Forskningsprogram
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-396336 (URN)10.1093/g3journal/jkac089 (DOI)000791204600001 ()35482488 (PubMedID)
Prosjekter
eSSENCE
Forskningsfinansiär
Swedish Research Council Formas, 2020-00712
Tilgjengelig fra: 2019-11-04 Laget: 2019-11-04 Sist oppdatert: 2025-02-01bibliografisk kontrollert
Ausmees, K. (2022). Methodology and Infrastructure for Statistical Computing in Genomics: Applications for Ancient DNA. (Doctoral dissertation). Uppsala: Acta Universitatis Upsaliensis
Åpne denne publikasjonen i ny fane eller vindu >>Methodology and Infrastructure for Statistical Computing in Genomics: Applications for Ancient DNA
2022 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

This thesis concerns the development and evaluation of computational methods for analysis of genetic data. A particular focus is on ancient DNA recovered from archaeological finds, the analysis of which has contributed to novel insights into human evolutionary and demographic history, while also introducing new challenges and the demand for specialized methods.

A main topic is that of imputation, or the inference of missing genotypes based on observed sequence data. We present results from a systematic evaluation of a common imputation pipeline on empirical ancient samples, and show that imputed data can constitute a realistic option for population-genetic analyses. We also develop a tool for genotype imputation that is based on the full probabilistic Li and Stephens model for haplotype frequencies and show that it can yield improved accuracy on particularly challenging data.  

Another central subject in genomics and population genetics is that of data characterization methods that allow for visualization and exploratory analysis of complex information. We discuss challenges associated with performing dimensionality reduction of genetic data, demonstrating how the use of principal component analysis is sensitive to incomplete information and performing an evaluation of methods to handle unobserved genotypes. We also discuss the use of deep learning models as an alternative to traditional methods of data characterization in genomics and propose a framework based on convolutional autoencoders that we exemplify on the applications of dimensionality reduction and genetic clustering.

In genomics, as in other fields of research, increasing sizes of data sets are placing larger demands on efficient data management and compute infrastructures. The final part of this thesis addresses the use of cloud resources for facilitating data analysis in scientific applications. We present two different cloud-based solutions, and exemplify them on applications from genomics.

sted, utgiver, år, opplag, sider
Uppsala: Acta Universitatis Upsaliensis, 2022. s. 53
Serie
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 2129
Emneord
statistical computing, genotype imputation, ancient DNA, deep learning, dimensionality reduction, genetic clustering, distributed computing
HSV kategori
Forskningsprogram
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-470703 (URN)978-91-513-1457-0 (ISBN)
Disputas
2022-06-08, 101121, Lägerhyddsvägen 1, Uppsala, 10:15 (engelsk)
Opponent
Veileder
Prosjekter
eSSENCE
Tilgjengelig fra: 2022-05-17 Laget: 2022-03-28 Sist oppdatert: 2025-02-01
John, A., Muenzen, K. & Ausmees, K. (2021). Evaluation of serverless computing for scalable execution of a joint variant calling workflow. PLOS ONE, 16(7), Article ID e0254363.
Åpne denne publikasjonen i ny fane eller vindu >>Evaluation of serverless computing for scalable execution of a joint variant calling workflow
2021 (engelsk)Inngår i: PLOS ONE, E-ISSN 1932-6203, Vol. 16, nr 7, artikkel-id e0254363Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.

sted, utgiver, år, opplag, sider
Public Library of Science (PLoS)PUBLIC LIBRARY SCIENCE, 2021
HSV kategori
Identifikatorer
urn:nbn:se:uu:diva-452320 (URN)10.1371/journal.pone.0254363 (DOI)000674301400079 ()34242357 (PubMedID)
Tilgjengelig fra: 2021-09-06 Laget: 2021-09-06 Sist oppdatert: 2024-01-15bibliografisk kontrollert
Ausmees, K. (2019). Efficient computational methods for applications in genomics. (Licentiate dissertation). Uppsala University
Åpne denne publikasjonen i ny fane eller vindu >>Efficient computational methods for applications in genomics
2019 (engelsk)Licentiatavhandling, med artikler (Annet vitenskapelig)
Abstract [en]

During the last two decades, advances in molecular technology have facilitated the sequencing and analysis of ancient DNA recovered from archaeological finds, contributing to novel insights into human evolutionary history. As more ancient genetic information has become available, the need for specialized methods of analysis has also increased. In this thesis, we investigate statistical and computational models for analysis of genetic data, with a particular focus on the context of ancient DNA.

The main focus is on imputation, or the inference of missing genotypes based on observed sequence data. We present results from a systematic evaluation of a common imputation pipeline on empirical ancient samples, and show that imputed data can constitute a realistic option for population-genetic analyses. We also discuss preliminary results from a simulation study comparing two methods of phasing and imputation, which suggest that the parametric Li and Stephens framework may be more robust to extremely low levels of sparsity than the parsimonious Browning and Browning model.

An evaluation of methods to handle missing data in the application of PCA for dimensionality reduction of genotype data is also presented. We illustrate that non-overlapping sequence data can lead to artifacts in projected scores, and evaluate different methods for handling unobserved genotypes.

In genomics, as in other fields of research, increasing sizes of data sets are placing larger demands on efficient data management and compute infrastructures. The last part of this thesis addresses the use of cloud resources for facilitating such analysis. We present two different cloud-based solutions, and exemplify them on applications from genomics.

sted, utgiver, år, opplag, sider
Uppsala University, 2019
Serie
IT licentiate theses / Uppsala University, Department of Information Technology, ISSN 1404-5117 ; 2019-006
HSV kategori
Forskningsprogram
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-396409 (URN)
Veileder
Prosjekter
eSSENCE
Tilgjengelig fra: 2019-11-04 Laget: 2019-11-04 Sist oppdatert: 2025-02-01bibliografisk kontrollert
Ausmees, K. (2019). Evaluation of methods handling missing data in PCA on genotype data: Applications for ancient DNA.
Åpne denne publikasjonen i ny fane eller vindu >>Evaluation of methods handling missing data in PCA on genotype data: Applications for ancient DNA
2019 (engelsk)Rapport (Annet vitenskapelig)
Serie
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2019-009
HSV kategori
Identifikatorer
urn:nbn:se:uu:diva-396346 (URN)
Prosjekter
eSSENCE
Tilgjengelig fra: 2019-11-04 Laget: 2019-11-04 Sist oppdatert: 2025-02-01bibliografisk kontrollert
John, A., Ausmees, K., Muenzen, K., Kuhn, C. & Tan, A. (2019). SWEEP: Accelerating scientific research through scalable serverless workflows. In: Companion Proc. 12th International Conference on Utility and Cloud Computing: . Paper presented at UCC 2019 (pp. 43-50). New York: ACM Press
Åpne denne publikasjonen i ny fane eller vindu >>SWEEP: Accelerating scientific research through scalable serverless workflows
Vise andre…
2019 (engelsk)Inngår i: Companion Proc. 12th International Conference on Utility and Cloud Computing, New York: ACM Press, 2019, s. 43-50Konferansepaper, Publicerat paper (Fagfellevurdert)
sted, utgiver, år, opplag, sider
New York: ACM Press, 2019
HSV kategori
Identifikatorer
urn:nbn:se:uu:diva-396405 (URN)10.1145/3368235.3368839 (DOI)978-1-4503-7044-8 (ISBN)
Konferanse
UCC 2019
Prosjekter
eSSENCE
Tilgjengelig fra: 2019-12-02 Laget: 2019-11-04 Sist oppdatert: 2022-03-28bibliografisk kontrollert
Ausmees, K., John, A., Toor, S. Z., Hellander, A. & Nettelblad, C. (2018). BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data. BMC Bioinformatics, 19, 240:1-11, Article ID 240.
Åpne denne publikasjonen i ny fane eller vindu >>BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
Vise andre…
2018 (engelsk)Inngår i: BMC Bioinformatics, E-ISSN 1471-2105, Vol. 19, s. 240:1-11, artikkel-id 240Artikkel i tidsskrift (Fagfellevurdert) Published
HSV kategori
Identifikatorer
urn:nbn:se:uu:diva-360033 (URN)10.1186/s12859-018-2241-z (DOI)000436517200001 ()29940842 (PubMedID)
Prosjekter
eSSENCE
Tilgjengelig fra: 2018-06-26 Laget: 2018-09-09 Sist oppdatert: 2025-02-01bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-6212-539x