Logotyp: till Uppsala universitets webbplats

uu.sePublikationer från Uppsala universitet
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Publikationer (10 of 10) Visa alla publikationer
Ausmees, K. & Nettelblad, C. (2023). Achieving improved accuracy for imputation of ancient DNA. Bioinformatics, 39(1), Article ID btac738.
Öppna denna publikation i ny flik eller fönster >>Achieving improved accuracy for imputation of ancient DNA
2023 (Engelska)Ingår i: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 39, nr 1, artikel-id btac738Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Motivation

Genotype imputation has the potential to increase the amount of information that can be gained from the often limited biological material available in ancient samples. As many widely used tools have been developed with modern data in mind, their design is not necessarily reflective of the requirements in studies of ancient DNA. Here, we investigate if an imputation method based on the full probabilistic Li and Stephens model of haplotype frequencies might be beneficial for the particular challenges posed by ancient data.

Results

We present an implementation called prophaser and compare imputation performance to two alternative pipelines that have been used in the ancient DNA community based on the Beagle software. Considering empirical ancient data downsampled to lower coverages as well as present-day samples with artificially thinned genotypes, we show that the proposed method is advantageous at lower coverages, where it yields improved accuracy and ability to capture rare variation. The software prophaser is optimized for running in a massively parallel manner and achieved reasonable runtimes on the experiments performed when executed on a GPU.

Ort, förlag, år, upplaga, sidor
Oxford University Press, 2023
Nyckelord
Bioinformatics, Computational biology, Genotype imputation, Ancient DNA
Nationell ämneskategori
Bioinformatik (beräkningsbiologi) Beräkningsmatematik Genetik och genomik
Forskningsämne
Bioinformatik; Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-470292 (URN)10.1093/bioinformatics/btac738 (DOI)000892594400001 ()36377787 (PubMedID)
Projekt
eSSENCE - An eScience Collaboration
Forskningsfinansiär
Forskningsrådet Formas, 2017-00453
Tillgänglig från: 2022-03-22 Skapad: 2022-03-22 Senast uppdaterad: 2025-02-01Bibliografiskt granskad
Ausmees, K. & Nettelblad, C. (2022). A deep learning framework for characterization of genotype data. G3: Genes, Genomes, Genetics, 12(3), Article ID jkac020.
Öppna denna publikation i ny flik eller fönster >>A deep learning framework for characterization of genotype data
2022 (Engelska)Ingår i: G3: Genes, Genomes, Genetics, E-ISSN 2160-1836, Vol. 12, nr 3, artikel-id jkac020Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Dimensionality reduction is a data transformation technique widely used in various fields of genomics research. The application of dimensionality reduction to genotype data is known to capture genetic similarity between individuals, and is used for visualization of genetic variation, identification of population structure as well as ancestry mapping. Among frequently used methods are principal component analysis, which is a linear transform that often misses more fine-scale structures, and neighbor-graph based methods which focus on local relationships rather than large-scale patterns. Deep learning models are a type of nonlinear machine learning method in which the features used in data transformation are decided by the model in a data-driven manner, rather than by the researcher, and have been shown to present a promising alternative to traditional statistical methods for various applications in omics research. In this study, we propose a deep learning model based on a convolutional autoencoder architecture for dimensionality reduction of genotype data. Using a highly diverse cohort of human samples, we demonstrate that the model can identify population clusters and provide richer visual information in comparison to principal component analysis, while preserving global geometry to a higher extent than t-SNE and UMAP, yielding results that are comparable to an alternative deep learning approach based on variational autoencoders. We also discuss the use of the methodology for more general characterization of genotype data, showing that it preserves spatial properties in the form of decay of linkage disequilibrium with distance along the genome and demonstrating its use as a genetic clustering method, comparing results to the ADMIXTURE software frequently used in population genetic studies.

Ort, förlag, år, upplaga, sidor
Oxford University PressOxford University Press (OUP), 2022
Nyckelord
deep learning, convolutional autoencoder, dimensionality reduction, genetic clustering, population genetics
Nationell ämneskategori
Bioinformatik (beräkningsbiologi) Beräkningsmatematik Genetik och genomik
Forskningsämne
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-470290 (URN)10.1093/g3journal/jkac020 (DOI)000776673300018 ()35078229 (PubMedID)
Projekt
eSSENCE - An eScience Collaboration
Forskningsfinansiär
Forskningsrådet Formas, 2017-00453Forskningsrådet Formas, 2020-00712
Tillgänglig från: 2022-03-22 Skapad: 2022-03-22 Senast uppdaterad: 2025-02-01Bibliografiskt granskad
Clouard, C., Ausmees, K. & Nettelblad, C. (2022). A joint use of pooling and imputation for genotyping SNPs. BMC Bioinformatics, 23, Article ID 421.
Öppna denna publikation i ny flik eller fönster >>A joint use of pooling and imputation for genotyping SNPs
2022 (Engelska)Ingår i: BMC Bioinformatics, E-ISSN 1471-2105, Vol. 23, artikel-id 421Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Background

Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented.

Results

We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts.

Conclusions

We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding.

Ort, förlag, år, upplaga, sidor
Springer Nature, 2022
Nyckelord
Pooling, Imputation, Genotyping
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-486864 (URN)10.1186/s12859-022-04974-7 (DOI)000867656900001 ()36229780 (PubMedID)
Projekt
eSSENCE - An eScience Collaboration
Forskningsfinansiär
Forskningsrådet Formas, 2017-00453Swedish National Infrastructure for Computing (SNIC), 2019/8-216Swedish National Infrastructure for Computing (SNIC), 2020/5-455Uppsala universitet
Tillgänglig från: 2022-10-18 Skapad: 2022-10-18 Senast uppdaterad: 2024-01-17Bibliografiskt granskad
Ausmees, K., Sanchez-Quinto, F., Jakobsson, M. & Nettelblad, C. (2022). An empirical evaluation of genotype imputation of ancient DNA. G3: Genes, Genomes, Genetics, 12(6), Article ID jkac089.
Öppna denna publikation i ny flik eller fönster >>An empirical evaluation of genotype imputation of ancient DNA
2022 (Engelska)Ingår i: G3: Genes, Genomes, Genetics, E-ISSN 2160-1836, Vol. 12, nr 6, artikel-id jkac089Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

With capabilities of sequencing ancient DNA to high coverage often limited by sample quality or cost, imputation of missing genotypes presents a possibility to increase the power of inference as well as cost-effectiveness for the analysis of ancient data. However, the high degree of uncertainty often associated with ancient DNA poses several methodological challenges, and performance of imputation methods in this context has not been fully explored. To gain further insights, we performed a systematic evaluation of imputation of ancient data using Beagle v4.0 and reference data from phase 3 of the 1000 Genomes project, investigating the effects of coverage, phased reference, and study sample size. Making use of five ancient individuals with high-coverage data available, we evaluated imputed data for accuracy, reference bias, and genetic affinities as captured by principal component analysis. We obtained genotype concordance levels of over 99% for data with 1× coverage, and similar levels of accuracy and reference bias at levels as low as 0.75×. Our findings suggest that using imputed data can be a realistic option for various population genetic analyses even for data in coverage ranges below 1×. We also show that a large and varied phased reference panel as well as the inclusion of low- to moderate-coverage ancient individuals in the study sample can increase imputation performance, particularly for rare alleles. In-depth analysis of imputed data with respect to genetic variants and allele frequencies gave further insight into the nature of errors arising during imputation, and can provide practical guidelines for postprocessing and validation prior to downstream analysis.

Ort, förlag, år, upplaga, sidor
Oxford University Press, 2022
Nationell ämneskategori
Beräkningsmatematik Genetik och genomik
Forskningsämne
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-396336 (URN)10.1093/g3journal/jkac089 (DOI)000791204600001 ()35482488 (PubMedID)
Projekt
eSSENCE
Forskningsfinansiär
Forskningsrådet Formas, 2020-00712
Tillgänglig från: 2019-11-04 Skapad: 2019-11-04 Senast uppdaterad: 2025-02-01Bibliografiskt granskad
Ausmees, K. (2022). Methodology and Infrastructure for Statistical Computing in Genomics: Applications for Ancient DNA. (Doctoral dissertation). Uppsala: Acta Universitatis Upsaliensis
Öppna denna publikation i ny flik eller fönster >>Methodology and Infrastructure for Statistical Computing in Genomics: Applications for Ancient DNA
2022 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

This thesis concerns the development and evaluation of computational methods for analysis of genetic data. A particular focus is on ancient DNA recovered from archaeological finds, the analysis of which has contributed to novel insights into human evolutionary and demographic history, while also introducing new challenges and the demand for specialized methods.

A main topic is that of imputation, or the inference of missing genotypes based on observed sequence data. We present results from a systematic evaluation of a common imputation pipeline on empirical ancient samples, and show that imputed data can constitute a realistic option for population-genetic analyses. We also develop a tool for genotype imputation that is based on the full probabilistic Li and Stephens model for haplotype frequencies and show that it can yield improved accuracy on particularly challenging data.  

Another central subject in genomics and population genetics is that of data characterization methods that allow for visualization and exploratory analysis of complex information. We discuss challenges associated with performing dimensionality reduction of genetic data, demonstrating how the use of principal component analysis is sensitive to incomplete information and performing an evaluation of methods to handle unobserved genotypes. We also discuss the use of deep learning models as an alternative to traditional methods of data characterization in genomics and propose a framework based on convolutional autoencoders that we exemplify on the applications of dimensionality reduction and genetic clustering.

In genomics, as in other fields of research, increasing sizes of data sets are placing larger demands on efficient data management and compute infrastructures. The final part of this thesis addresses the use of cloud resources for facilitating data analysis in scientific applications. We present two different cloud-based solutions, and exemplify them on applications from genomics.

Ort, förlag, år, upplaga, sidor
Uppsala: Acta Universitatis Upsaliensis, 2022. s. 53
Serie
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 2129
Nyckelord
statistical computing, genotype imputation, ancient DNA, deep learning, dimensionality reduction, genetic clustering, distributed computing
Nationell ämneskategori
Bioinformatik (beräkningsbiologi) Beräkningsmatematik Genetik och genomik Programvaruteknik
Forskningsämne
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-470703 (URN)978-91-513-1457-0 (ISBN)
Disputation
2022-06-08, 101121, Lägerhyddsvägen 1, Uppsala, 10:15 (Engelska)
Opponent
Handledare
Projekt
eSSENCE
Tillgänglig från: 2022-05-17 Skapad: 2022-03-28 Senast uppdaterad: 2025-02-01
John, A., Muenzen, K. & Ausmees, K. (2021). Evaluation of serverless computing for scalable execution of a joint variant calling workflow. PLOS ONE, 16(7), Article ID e0254363.
Öppna denna publikation i ny flik eller fönster >>Evaluation of serverless computing for scalable execution of a joint variant calling workflow
2021 (Engelska)Ingår i: PLOS ONE, E-ISSN 1932-6203, Vol. 16, nr 7, artikel-id e0254363Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.

Ort, förlag, år, upplaga, sidor
Public Library of Science (PLoS)PUBLIC LIBRARY SCIENCE, 2021
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Identifikatorer
urn:nbn:se:uu:diva-452320 (URN)10.1371/journal.pone.0254363 (DOI)000674301400079 ()34242357 (PubMedID)
Tillgänglig från: 2021-09-06 Skapad: 2021-09-06 Senast uppdaterad: 2024-01-15Bibliografiskt granskad
Ausmees, K. (2019). Efficient computational methods for applications in genomics. (Licentiate dissertation). Uppsala University
Öppna denna publikation i ny flik eller fönster >>Efficient computational methods for applications in genomics
2019 (Engelska)Licentiatavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

During the last two decades, advances in molecular technology have facilitated the sequencing and analysis of ancient DNA recovered from archaeological finds, contributing to novel insights into human evolutionary history. As more ancient genetic information has become available, the need for specialized methods of analysis has also increased. In this thesis, we investigate statistical and computational models for analysis of genetic data, with a particular focus on the context of ancient DNA.

The main focus is on imputation, or the inference of missing genotypes based on observed sequence data. We present results from a systematic evaluation of a common imputation pipeline on empirical ancient samples, and show that imputed data can constitute a realistic option for population-genetic analyses. We also discuss preliminary results from a simulation study comparing two methods of phasing and imputation, which suggest that the parametric Li and Stephens framework may be more robust to extremely low levels of sparsity than the parsimonious Browning and Browning model.

An evaluation of methods to handle missing data in the application of PCA for dimensionality reduction of genotype data is also presented. We illustrate that non-overlapping sequence data can lead to artifacts in projected scores, and evaluate different methods for handling unobserved genotypes.

In genomics, as in other fields of research, increasing sizes of data sets are placing larger demands on efficient data management and compute infrastructures. The last part of this thesis addresses the use of cloud resources for facilitating such analysis. We present two different cloud-based solutions, and exemplify them on applications from genomics.

Ort, förlag, år, upplaga, sidor
Uppsala University, 2019
Serie
IT licentiate theses / Uppsala University, Department of Information Technology, ISSN 1404-5117 ; 2019-006
Nationell ämneskategori
Beräkningsmatematik Genetik och genomik
Forskningsämne
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-396409 (URN)
Handledare
Projekt
eSSENCE
Tillgänglig från: 2019-11-04 Skapad: 2019-11-04 Senast uppdaterad: 2025-02-01Bibliografiskt granskad
Ausmees, K. (2019). Evaluation of methods handling missing data in PCA on genotype data: Applications for ancient DNA.
Öppna denna publikation i ny flik eller fönster >>Evaluation of methods handling missing data in PCA on genotype data: Applications for ancient DNA
2019 (Engelska)Rapport (Övrigt vetenskapligt)
Serie
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2019-009
Nationell ämneskategori
Beräkningsmatematik Genetik och genomik
Identifikatorer
urn:nbn:se:uu:diva-396346 (URN)
Projekt
eSSENCE
Tillgänglig från: 2019-11-04 Skapad: 2019-11-04 Senast uppdaterad: 2025-02-01Bibliografiskt granskad
John, A., Ausmees, K., Muenzen, K., Kuhn, C. & Tan, A. (2019). SWEEP: Accelerating scientific research through scalable serverless workflows. In: Companion Proc. 12th International Conference on Utility and Cloud Computing: . Paper presented at UCC 2019 (pp. 43-50). New York: ACM Press
Öppna denna publikation i ny flik eller fönster >>SWEEP: Accelerating scientific research through scalable serverless workflows
Visa övriga...
2019 (Engelska)Ingår i: Companion Proc. 12th International Conference on Utility and Cloud Computing, New York: ACM Press, 2019, s. 43-50Konferensbidrag, Publicerat paper (Refereegranskat)
Ort, förlag, år, upplaga, sidor
New York: ACM Press, 2019
Nationell ämneskategori
Programvaruteknik
Identifikatorer
urn:nbn:se:uu:diva-396405 (URN)10.1145/3368235.3368839 (DOI)978-1-4503-7044-8 (ISBN)
Konferens
UCC 2019
Projekt
eSSENCE
Tillgänglig från: 2019-12-02 Skapad: 2019-11-04 Senast uppdaterad: 2022-03-28Bibliografiskt granskad
Ausmees, K., John, A., Toor, S. Z., Hellander, A. & Nettelblad, C. (2018). BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data. BMC Bioinformatics, 19, 240:1-11, Article ID 240.
Öppna denna publikation i ny flik eller fönster >>BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
Visa övriga...
2018 (Engelska)Ingår i: BMC Bioinformatics, E-ISSN 1471-2105, Vol. 19, s. 240:1-11, artikel-id 240Artikel i tidskrift (Refereegranskat) Published
Nationell ämneskategori
Programvaruteknik Genetik och genomik
Identifikatorer
urn:nbn:se:uu:diva-360033 (URN)10.1186/s12859-018-2241-z (DOI)000436517200001 ()29940842 (PubMedID)
Projekt
eSSENCE
Tillgänglig från: 2018-06-26 Skapad: 2018-09-09 Senast uppdaterad: 2025-02-01Bibliografiskt granskad
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-6212-539x

Sök vidare i DiVA

Visa alla publikationer