Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
Link to record
Permanent link

Direct link
Publications (10 of 50) Show all publications
Clouard, C. & Nettelblad, C. (2025). Using feedback in pooled experiments augmented with imputation for high genotyping accuracy at reduced cost. G3: Genes, Genomes, Genetics, 15(3), Article ID jkaf010.
Open this publication in new window or tab >>Using feedback in pooled experiments augmented with imputation for high genotyping accuracy at reduced cost
2025 (English)In: G3: Genes, Genomes, Genetics, E-ISSN 2160-1836, Vol. 15, no 3, article id jkaf010Article in journal (Refereed) Published
Abstract [en]

Conducting genomic selection in plant breeding programs can substantially speed up the development of new varieties. Genomic selection provides more reliable insights when it is based on dense marker data, in which the rare variants can be particularly informative. Despite the availability of new technologies, the cost of large-scale genotyping remains a major limitation to the implementation of genomic selection. We suggest to combine pooled genotyping with population-based imputation as a cost-effective computational strategy for genotyping SNPs. Pooling saves genotyping tests and has proven to accurately capture the rare variants that are usually missed by imputation. In this study, we investigate adding iterative coupling to a joint model of pooling and imputation that we have previously proposed. In each iteration, the imputed genotype probabilities serve as feedback input for adjusting the per-sample prior genotype probabilities, before running a new imputation based on these adjusted data. This flexible setup indirectly imposes consistency between the imputed genotypes and the pooled observations. We demonstrate that repeated cycles of feedback can take advantage of the strengths in both pooling and imputation when an appropriate set of reference haplotypes is available for imputation. The iterations improve greatly upon the initial genotype predictions, achieving very high genotype accuracy for both low and high frequency variants. We enhance the average concordance from 94.5% to 98.4% at limited computational cost and without requiring any additional genotype testing.

Place, publisher, year, edition, pages
Oxford University Press, 2025
Keywords
SNP array, pooling, imputation, iterative refinement
National Category
Bioinformatics (Computational Biology)
Research subject
Scientific Computing
Identifiers
urn:nbn:se:uu:diva-518429 (URN)10.1093/g3journal/jkaf010 (DOI)001424950400001 ()39847531 (PubMedID)2-s2.0-105001185182 (Scopus ID)
Funder
Swedish Research Council Formas, 2017-00453
Available from: 2023-12-19 Created: 2023-12-19 Last updated: 2025-06-18Bibliographically approved
Gonthier, M., Larsson, E., Marchal, L., Nettelblad, C. & Thibault, S. (2024). Data-Driven Locality-Aware Batch Scheduling. In: 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): . Paper presented at 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 27-31 May, 2024, San Francisco, CA, USA (pp. 202-211). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Data-Driven Locality-Aware Batch Scheduling
Show others...
2024 (English)In: 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Institute of Electrical and Electronics Engineers (IEEE), 2024, p. 202-211Conference paper, Published paper (Refereed)
Abstract [en]

Clusters employ workload schedulers such as the Sturm Workload Manager to allocate computing jobs onto nodes. These schedulers usually aim at a good trade-off between increasing resource utilization and user satisfaction (decreasing job waiting time). However, these schedulers are typically unaware of jobs sharing large input files, which may happen in data intensive scenarios. The same input files may end up being loaded several times, leading to a waste of resources. We study how to design a data-aware job scheduler that is able to keep large input files on the computing nodes, without impacting other memory needs, and can benefit from previously-loaded tiles to decrease data transfers in order to reduce the waiting times ofjobs. We present three schedulers capable of distributing the load between the computing nodes as well as re-using input files already loaded in the memory of some node as much as possible. We perform simulations with single node jobs using traces of real HPC-cluster usage, to compare them to classical job schedulers. The results show that keeping data in local memory between successive jobs and using data -locality information to schedule jobs improves performance compared to a widely -used scheduler (FCFS, with and without backfilling): a reduction in job waiting time (a 7.5% improvement in stretch), and a decrease in the amount of data transfers (7%).

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
Batch scheduling, Job input sharing, Data aware, Job scheduling, High Performance Data Analytics
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-539007 (URN)10.1109/IPDPSW63119.2024.00058 (DOI)001284697300050 ()979-8-3503-6461-3 (ISBN)979-8-3503-6460-6 (ISBN)
Conference
2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 27-31 May, 2024, San Francisco, CA, USA
Available from: 2024-09-23 Created: 2024-09-23 Last updated: 2024-09-23Bibliographically approved
Clouard, C. & Nettelblad, C. (2024). Genotyping of SNPs in bread wheat at reduced cost from pooled experiments and imputation. Theoretical and Applied Genetics, 137(1), Article ID 26.
Open this publication in new window or tab >>Genotyping of SNPs in bread wheat at reduced cost from pooled experiments and imputation
2024 (English)In: Theoretical and Applied Genetics, ISSN 0040-5752, E-ISSN 1432-2242, Vol. 137, no 1, article id 26Article in journal (Refereed) Published
Abstract [en]

The plant breeding industry has shown growing interest in using the genotype data of relevant markers for performing selection of new competitive varieties. The selection usually benefits from large amounts of marker data and it is therefore crucial to dispose of data collection methods that are both cost-effective and reliable. Computational methods such as genotype imputation have been proposed earlier in several plant science studies for addressing the cost challenge. Genotype imputation methods have though been used more frequently and investigated more extensively in human genetics research. The various algorithms that exist have shown lower accuracy at inferring the genotype of genetic variants occurring at low frequency, while these rare variants can have great significance and impact in the genetic studies that underlie selection. In contrast, pooling is a technique that can efficiently identify low-frequency items in a population and it has been successfully used for detecting the samples that carry rare variants in a population. In this study, we propose to combine pooling and imputation, and demonstrate this by simulating a hypothetical microarray for genotyping a population of recombinant inbred lines in a cost-effective and accurate manner, even for rare variants. We show that with an adequate imputation model, it is feasible to accurately predictthe individual genotypes at lower cost than sample-wise genotyping and time-effectively. Moreover, we provide code resources for reproducing the results presented in this study in the form of a containerized workflow.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
genotyping, imputation, MAGIC population, pooling, wheat
National Category
Bioinformatics (Computational Biology)
Research subject
Scientific Computing
Identifiers
urn:nbn:se:uu:diva-518436 (URN)10.1007/s00122-023-04533-5 (DOI)001145311600001 ()38243086 (PubMedID)
Projects
eSSENCE - An eScience Collaboration
Funder
Swedish Research Council Formas, 2017-00453
Available from: 2024-01-10 Created: 2024-01-10 Last updated: 2025-01-07Bibliographically approved
Ekeberg, T., Assalauova, D., Bielecki, J., Boll, R., Daurer, B. J., Eichacker, L. A., . . . Maia, F. R. N. (2024). Observation of a single protein by ultrafast X-ray diffraction. Light: Science & Applications, 13(1), Article ID 15.
Open this publication in new window or tab >>Observation of a single protein by ultrafast X-ray diffraction
Show others...
2024 (English)In: Light: Science & Applications, ISSN 2095-5545, E-ISSN 2047-7538, Vol. 13, no 1, article id 15Article in journal (Refereed) Published
Abstract [en]

The idea of using ultrashort X-ray pulses to obtain images of single proteins frozen in time has fascinated and inspired many. It was one of the arguments for building X-ray free-electron lasers. According to theory, the extremely intense pulses provide sufficient signal to dispense with using crystals as an amplifier, and the ultrashort pulse duration permits capturing the diffraction data before the sample inevitably explodes. This was first demonstrated on biological samples a decade ago on the giant mimivirus. Since then, a large collaboration has been pushing the limit of the smallest sample that can be imaged. The ability to capture snapshots on the timescale of atomic vibrations, while keeping the sample at room temperature, may allow probing the entire conformational phase space of macromolecules. Here we show the first observation of an X-ray diffraction pattern from a single protein, that of Escherichia coli GroEL which at 14 nm in diameter is the smallest biological sample ever imaged by X-rays, and demonstrate that the concept of diffraction before destruction extends to single proteins. From the pattern, it is possible to determine the approximate orientation of the protein. Our experiment demonstrates the feasibility of ultrafast imaging of single proteins, opening the way to single-molecule time-resolved studies on the femtosecond timescale.

Place, publisher, year, edition, pages
Springer Nature, 2024
National Category
Biophysics
Identifiers
urn:nbn:se:uu:diva-520488 (URN)10.1038/s41377-023-01352-7 (DOI)001142025600001 ()38216563 (PubMedID)
Funder
German Research Foundation (DFG), 152/772-1German Research Foundation (DFG), 152/774-1German Research Foundation (DFG), 152/775-1German Research Foundation (DFG), 152/776-1German Research Foundation (DFG), 152/777-1German Research Foundation (DFG), 390715994EU, European Research Council, 614507European Regional Development Fund (ERDF), CZ.02.1.01/0.0/0.0/15_003/0000447Swedish Research Council, 2017-05336Swedish Research Council, 2018-00234Swedish Research Council, 2019-03935Swedish Foundation for Strategic Research, ITM17-0455
Available from: 2024-01-12 Created: 2024-01-12 Last updated: 2025-02-20Bibliographically approved
Suazo, M., Zackrisson, E., Mahto, P., Lundell, F., Nettelblad, C., Korn, A. J., . . . Majumdar, S. (2024). Project Hephaistos – II. Dyson sphere candidates from Gaia DR3, 2MASS, and WISE. Monthly notices of the Royal Astronomical Society, 531(1), 695-707
Open this publication in new window or tab >>Project Hephaistos – II. Dyson sphere candidates from Gaia DR3, 2MASS, and WISE
Show others...
2024 (English)In: Monthly notices of the Royal Astronomical Society, ISSN 0035-8711, E-ISSN 1365-2966, Vol. 531, no 1, p. 695-707Article in journal (Refereed) Published
Abstract [en]

The search for extraterrestrial intelligence is currently being pursued using multiple techniques and in different wavelength bands. Dyson spheres, megastructures that could be constructed by advanced civilizations to harness the radiation energy of their host stars, represent a potential technosignature, that in principle may be hiding in public data already collected as part of large astronomical surveys. In this study, we present a comprehensive search for partial Dyson spheres by analysing optical and infrared observations from Gaia, 2MASS, and WISE. We develop a pipeline that employs multiple filters to identify potential candidates and reject interlopers in a sample of five million objects, which incorporates a convolutional neural network to help identify confusion in WISE data. Finally, the pipeline identifies seven candidates deserving of further analysis. All of these objects are M-dwarfs, for which astrophysical phenomena cannot easily account for the observed infrared excess emission.

Place, publisher, year, edition, pages
Oxford University Press, 2024
Keywords
Extraterrestrial intelligence, infrared:stars, stars:low-mass
National Category
Astronomy, Astrophysics and Cosmology
Research subject
Astronomy and Astrophysics; Astronomy
Identifiers
urn:nbn:se:uu:diva-525985 (URN)10.1093/mnras/stae1186 (DOI)001224506300003 ()
Projects
eSSENCE - An eScience Collaboration
Note

Paper submitted to Monthly Notices of the Royal Astronomical Society

Available from: 2024-04-02 Created: 2024-04-02 Last updated: 2025-01-07Bibliographically approved
Ausmees, K. & Nettelblad, C. (2023). Achieving improved accuracy for imputation of ancient DNA. Bioinformatics, 39(1), Article ID btac738.
Open this publication in new window or tab >>Achieving improved accuracy for imputation of ancient DNA
2023 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 39, no 1, article id btac738Article in journal (Refereed) Published
Abstract [en]

Motivation

Genotype imputation has the potential to increase the amount of information that can be gained from the often limited biological material available in ancient samples. As many widely used tools have been developed with modern data in mind, their design is not necessarily reflective of the requirements in studies of ancient DNA. Here, we investigate if an imputation method based on the full probabilistic Li and Stephens model of haplotype frequencies might be beneficial for the particular challenges posed by ancient data.

Results

We present an implementation called prophaser and compare imputation performance to two alternative pipelines that have been used in the ancient DNA community based on the Beagle software. Considering empirical ancient data downsampled to lower coverages as well as present-day samples with artificially thinned genotypes, we show that the proposed method is advantageous at lower coverages, where it yields improved accuracy and ability to capture rare variation. The software prophaser is optimized for running in a massively parallel manner and achieved reasonable runtimes on the experiments performed when executed on a GPU.

Place, publisher, year, edition, pages
Oxford University Press, 2023
Keywords
Bioinformatics, Computational biology, Genotype imputation, Ancient DNA
National Category
Bioinformatics (Computational Biology) Computational Mathematics Genetics and Genomics
Research subject
Bioinformatics; Scientific Computing
Identifiers
urn:nbn:se:uu:diva-470292 (URN)10.1093/bioinformatics/btac738 (DOI)000892594400001 ()36377787 (PubMedID)
Projects
eSSENCE - An eScience Collaboration
Funder
Swedish Research Council Formas, 2017-00453
Available from: 2022-03-22 Created: 2022-03-22 Last updated: 2025-02-01Bibliographically approved
Ausmees, K. & Nettelblad, C. (2022). A deep learning framework for characterization of genotype data. G3: Genes, Genomes, Genetics, 12(3), Article ID jkac020.
Open this publication in new window or tab >>A deep learning framework for characterization of genotype data
2022 (English)In: G3: Genes, Genomes, Genetics, E-ISSN 2160-1836, Vol. 12, no 3, article id jkac020Article in journal (Refereed) Published
Abstract [en]

Dimensionality reduction is a data transformation technique widely used in various fields of genomics research. The application of dimensionality reduction to genotype data is known to capture genetic similarity between individuals, and is used for visualization of genetic variation, identification of population structure as well as ancestry mapping. Among frequently used methods are principal component analysis, which is a linear transform that often misses more fine-scale structures, and neighbor-graph based methods which focus on local relationships rather than large-scale patterns. Deep learning models are a type of nonlinear machine learning method in which the features used in data transformation are decided by the model in a data-driven manner, rather than by the researcher, and have been shown to present a promising alternative to traditional statistical methods for various applications in omics research. In this study, we propose a deep learning model based on a convolutional autoencoder architecture for dimensionality reduction of genotype data. Using a highly diverse cohort of human samples, we demonstrate that the model can identify population clusters and provide richer visual information in comparison to principal component analysis, while preserving global geometry to a higher extent than t-SNE and UMAP, yielding results that are comparable to an alternative deep learning approach based on variational autoencoders. We also discuss the use of the methodology for more general characterization of genotype data, showing that it preserves spatial properties in the form of decay of linkage disequilibrium with distance along the genome and demonstrating its use as a genetic clustering method, comparing results to the ADMIXTURE software frequently used in population genetic studies.

Place, publisher, year, edition, pages
Oxford University PressOxford University Press (OUP), 2022
Keywords
deep learning, convolutional autoencoder, dimensionality reduction, genetic clustering, population genetics
National Category
Bioinformatics (Computational Biology) Computational Mathematics Genetics and Genomics
Research subject
Scientific Computing
Identifiers
urn:nbn:se:uu:diva-470290 (URN)10.1093/g3journal/jkac020 (DOI)000776673300018 ()35078229 (PubMedID)
Projects
eSSENCE - An eScience Collaboration
Funder
Swedish Research Council Formas, 2017-00453Swedish Research Council Formas, 2020-00712
Available from: 2022-03-22 Created: 2022-03-22 Last updated: 2025-02-01Bibliographically approved
Clouard, C., Ausmees, K. & Nettelblad, C. (2022). A joint use of pooling and imputation for genotyping SNPs. BMC Bioinformatics, 23, Article ID 421.
Open this publication in new window or tab >>A joint use of pooling and imputation for genotyping SNPs
2022 (English)In: BMC Bioinformatics, E-ISSN 1471-2105, Vol. 23, article id 421Article in journal (Refereed) Published
Abstract [en]

Background

Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented.

Results

We conduct simulations based on human data from the 1000 Genomes Project, to aid comparison with other imputation studies. Based on the simulated data, we find that pooling impacts the genotype frequencies of the directly identifiable markers, without imputation. We also demonstrate how a combinatorial estimation of the genotype probabilities from the pooling design can improve the prediction performance of imputation models. Our algorithm achieves 93% concordance in predicting unassayed markers from pooled data, thus it outperforms the Beagle imputation model which reaches 80% concordance. We observe that the pooling design gives higher concordance for the rare variants than traditional low-density to high-density imputation commonly used for cost-effective genotyping of large cohorts.

Conclusions

We present promising results for combining a pooling scheme for SNP genotyping with computational genotype imputation on human data. These results could find potential applications in any context where the genotyping costs form a limiting factor on the study size, such as in marker-assisted selection in plant breeding.

Place, publisher, year, edition, pages
Springer Nature, 2022
Keywords
Pooling, Imputation, Genotyping
National Category
Bioinformatics (Computational Biology)
Research subject
Scientific Computing
Identifiers
urn:nbn:se:uu:diva-486864 (URN)10.1186/s12859-022-04974-7 (DOI)000867656900001 ()36229780 (PubMedID)
Projects
eSSENCE - An eScience Collaboration
Funder
Swedish Research Council Formas, 2017-00453Swedish National Infrastructure for Computing (SNIC), 2019/8-216Swedish National Infrastructure for Computing (SNIC), 2020/5-455Uppsala University
Available from: 2022-10-18 Created: 2022-10-18 Last updated: 2024-01-17Bibliographically approved
Ausmees, K., Sanchez-Quinto, F., Jakobsson, M. & Nettelblad, C. (2022). An empirical evaluation of genotype imputation of ancient DNA. G3: Genes, Genomes, Genetics, 12(6), Article ID jkac089.
Open this publication in new window or tab >>An empirical evaluation of genotype imputation of ancient DNA
2022 (English)In: G3: Genes, Genomes, Genetics, E-ISSN 2160-1836, Vol. 12, no 6, article id jkac089Article in journal (Refereed) Published
Abstract [en]

With capabilities of sequencing ancient DNA to high coverage often limited by sample quality or cost, imputation of missing genotypes presents a possibility to increase the power of inference as well as cost-effectiveness for the analysis of ancient data. However, the high degree of uncertainty often associated with ancient DNA poses several methodological challenges, and performance of imputation methods in this context has not been fully explored. To gain further insights, we performed a systematic evaluation of imputation of ancient data using Beagle v4.0 and reference data from phase 3 of the 1000 Genomes project, investigating the effects of coverage, phased reference, and study sample size. Making use of five ancient individuals with high-coverage data available, we evaluated imputed data for accuracy, reference bias, and genetic affinities as captured by principal component analysis. We obtained genotype concordance levels of over 99% for data with 1× coverage, and similar levels of accuracy and reference bias at levels as low as 0.75×. Our findings suggest that using imputed data can be a realistic option for various population genetic analyses even for data in coverage ranges below 1×. We also show that a large and varied phased reference panel as well as the inclusion of low- to moderate-coverage ancient individuals in the study sample can increase imputation performance, particularly for rare alleles. In-depth analysis of imputed data with respect to genetic variants and allele frequencies gave further insight into the nature of errors arising during imputation, and can provide practical guidelines for postprocessing and validation prior to downstream analysis.

Place, publisher, year, edition, pages
Oxford University Press, 2022
National Category
Computational Mathematics Genetics and Genomics
Research subject
Scientific Computing
Identifiers
urn:nbn:se:uu:diva-396336 (URN)10.1093/g3journal/jkac089 (DOI)000791204600001 ()35482488 (PubMedID)
Projects
eSSENCE
Funder
Swedish Research Council Formas, 2020-00712
Available from: 2019-11-04 Created: 2019-11-04 Last updated: 2025-02-01Bibliographically approved
Clouard, C. & Nettelblad, C. (2022). Consistency Study of a Reconstructed Genotype Probability Distribution via Clustered Bootstrapping in NORB Pooling Blocks. Uppsala: Uppsala University
Open this publication in new window or tab >>Consistency Study of a Reconstructed Genotype Probability Distribution via Clustered Bootstrapping in NORB Pooling Blocks
2022 (English)Report (Other academic)
Abstract [en]

For applications with biallelic genetic markers, group testing techniques, synonymous to pooling techniques, are usually applied for decreasing the cost of large-scale testing as e.g. when detecting carriers of rare genetic variants. In some configurations, the results of the grouped tests cannot be decoded and the pooled items are missing. Inference of these missing items can be performed with specific statistical methods that are for example related to the Expectation-Maximization algorithm. Pooling has also been applied for determining the genotype of markers in large populations. The particularity of full genotype data for diploid organisms in the context of group testing are the ternary outcomes (two homozygous genotypes and one heterozygous), as well as the distribution of these three outcomes in a population, which is often ruled by the Hardy-Weinberg Equilibrium and depends on the allele frequency in such situation. When using a nonoverlapping repeated block pooling design, the missing items are only observed in particular arrangements. Overall, a data set of pooled genotypes can be described as an inference problem in Missing Not At Random data with nonmonotone missingness patterns. This study presents a preliminary investigation of the consistency of various iterative methods estimating the most likely genotype probabilities of the missing items in pooled data. We use the Kullback-Leibler divergence and the L2 distance between the genotype distribution computed from our estimates and a simulated empirical distribution as a measure of the distributional consistency.

Place, publisher, year, edition, pages
Uppsala: Uppsala University, 2022. p. 13
Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2022-005
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:uu:diva-487718 (URN)
Available from: 2022-10-31 Created: 2022-10-31 Last updated: 2024-01-10Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0458-6902

Search in DiVA

Show all publications