Logotyp: till Uppsala universitets webbplats

uu.sePublikationer från Uppsala universitet
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
A deep learning framework for characterization of genotype data
Uppsala universitet, Teknisk-naturvetenskapliga vetenskapsområdet, Matematisk-datavetenskapliga sektionen, Institutionen för informationsteknologi, Avdelningen för beräkningsvetenskap. Uppsala universitet, Teknisk-naturvetenskapliga vetenskapsområdet, Matematisk-datavetenskapliga sektionen, Institutionen för informationsteknologi, Tillämpad beräkningsvetenskap. Uppsala universitet, Science for Life Laboratory, SciLifeLab.ORCID-id: 0000-0002-6212-539x
Uppsala universitet, Teknisk-naturvetenskapliga vetenskapsområdet, Matematisk-datavetenskapliga sektionen, Institutionen för informationsteknologi, Avdelningen för beräkningsvetenskap. Uppsala universitet, Teknisk-naturvetenskapliga vetenskapsområdet, Matematisk-datavetenskapliga sektionen, Institutionen för informationsteknologi, Tillämpad beräkningsvetenskap. Uppsala universitet, Science for Life Laboratory, SciLifeLab.ORCID-id: 0000-0003-0458-6902
2022 (Engelska)Ingår i: G3: Genes, Genomes, Genetics, E-ISSN 2160-1836, Vol. 12, nr 3, artikel-id jkac020Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Dimensionality reduction is a data transformation technique widely used in various fields of genomics research. The application of dimensionality reduction to genotype data is known to capture genetic similarity between individuals, and is used for visualization of genetic variation, identification of population structure as well as ancestry mapping. Among frequently used methods are principal component analysis, which is a linear transform that often misses more fine-scale structures, and neighbor-graph based methods which focus on local relationships rather than large-scale patterns. Deep learning models are a type of nonlinear machine learning method in which the features used in data transformation are decided by the model in a data-driven manner, rather than by the researcher, and have been shown to present a promising alternative to traditional statistical methods for various applications in omics research. In this study, we propose a deep learning model based on a convolutional autoencoder architecture for dimensionality reduction of genotype data. Using a highly diverse cohort of human samples, we demonstrate that the model can identify population clusters and provide richer visual information in comparison to principal component analysis, while preserving global geometry to a higher extent than t-SNE and UMAP, yielding results that are comparable to an alternative deep learning approach based on variational autoencoders. We also discuss the use of the methodology for more general characterization of genotype data, showing that it preserves spatial properties in the form of decay of linkage disequilibrium with distance along the genome and demonstrating its use as a genetic clustering method, comparing results to the ADMIXTURE software frequently used in population genetic studies.

Ort, förlag, år, upplaga, sidor
Oxford University Press (OUP) Oxford University Press, 2022. Vol. 12, nr 3, artikel-id jkac020
Nyckelord [en]
deep learning, convolutional autoencoder, dimensionality reduction, genetic clustering, population genetics
Nationell ämneskategori
Bioinformatik (beräkningsbiologi) Beräkningsmatematik Genetik och genomik
Forskningsämne
Beräkningsvetenskap
Identifikatorer
URN: urn:nbn:se:uu:diva-470290DOI: 10.1093/g3journal/jkac020ISI: 000776673300018PubMedID: 35078229OAI: oai:DiVA.org:uu-470290DiVA, id: diva2:1646307
Projekt
eSSENCE - An eScience Collaboration
Forskningsfinansiär
Forskningsrådet Formas, 2017-00453Forskningsrådet Formas, 2020-00712Tillgänglig från: 2022-03-22 Skapad: 2022-03-22 Senast uppdaterad: 2025-02-01Bibliografiskt granskad
Ingår i avhandling
1. Methodology and Infrastructure for Statistical Computing in Genomics: Applications for Ancient DNA
Öppna denna publikation i ny flik eller fönster >>Methodology and Infrastructure for Statistical Computing in Genomics: Applications for Ancient DNA
2022 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

This thesis concerns the development and evaluation of computational methods for analysis of genetic data. A particular focus is on ancient DNA recovered from archaeological finds, the analysis of which has contributed to novel insights into human evolutionary and demographic history, while also introducing new challenges and the demand for specialized methods.

A main topic is that of imputation, or the inference of missing genotypes based on observed sequence data. We present results from a systematic evaluation of a common imputation pipeline on empirical ancient samples, and show that imputed data can constitute a realistic option for population-genetic analyses. We also develop a tool for genotype imputation that is based on the full probabilistic Li and Stephens model for haplotype frequencies and show that it can yield improved accuracy on particularly challenging data.  

Another central subject in genomics and population genetics is that of data characterization methods that allow for visualization and exploratory analysis of complex information. We discuss challenges associated with performing dimensionality reduction of genetic data, demonstrating how the use of principal component analysis is sensitive to incomplete information and performing an evaluation of methods to handle unobserved genotypes. We also discuss the use of deep learning models as an alternative to traditional methods of data characterization in genomics and propose a framework based on convolutional autoencoders that we exemplify on the applications of dimensionality reduction and genetic clustering.

In genomics, as in other fields of research, increasing sizes of data sets are placing larger demands on efficient data management and compute infrastructures. The final part of this thesis addresses the use of cloud resources for facilitating data analysis in scientific applications. We present two different cloud-based solutions, and exemplify them on applications from genomics.

Ort, förlag, år, upplaga, sidor
Uppsala: Acta Universitatis Upsaliensis, 2022. s. 53
Serie
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 2129
Nyckelord
statistical computing, genotype imputation, ancient DNA, deep learning, dimensionality reduction, genetic clustering, distributed computing
Nationell ämneskategori
Bioinformatik (beräkningsbiologi) Beräkningsmatematik Genetik och genomik Programvaruteknik
Forskningsämne
Beräkningsvetenskap
Identifikatorer
urn:nbn:se:uu:diva-470703 (URN)978-91-513-1457-0 (ISBN)
Disputation
2022-06-08, 101121, Lägerhyddsvägen 1, Uppsala, 10:15 (Engelska)
Opponent
Handledare
Projekt
eSSENCE
Tillgänglig från: 2022-05-17 Skapad: 2022-03-28 Senast uppdaterad: 2025-02-01

Open Access i DiVA

fulltext(1500 kB)744 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 1500 kBChecksumma SHA-512
420972a7711af712c3ad0c2af366f88f8dde2315215c3040bff54e3bef27bd440f69cf23ede95ad2547f97166b8a59347be71732200406c41cd1fd1c23cee335
Typ fulltextMimetyp application/pdf

Övriga länkar

Förlagets fulltextPubMed

Person

Ausmees, KristiinaNettelblad, Carl

Sök vidare i DiVA

Av författaren/redaktören
Ausmees, KristiinaNettelblad, Carl
Av organisationen
Avdelningen för beräkningsvetenskapTillämpad beräkningsvetenskapScience for Life Laboratory, SciLifeLab
I samma tidskrift
G3: Genes, Genomes, Genetics
Bioinformatik (beräkningsbiologi)BeräkningsmatematikGenetik och genomik

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 744 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

doi
pubmed
urn-nbn

Altmetricpoäng

doi
pubmed
urn-nbn
Totalt: 586 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf