uu.seUppsala University Publications
Change search
ReferencesLink to record
Permanent link

Direct link
Developer Friendly and Computationally Efficient Predictive Modeling without Information Leakage: The emil Package for R
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Medical Sciences, Cancer Pharmacology and Computational Medicine.ORCID iD: 0000-0002-9615-5079
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Medical Sciences, Cancer Pharmacology and Computational Medicine.
(English)In: Journal of Statistical Software, ISSN 1548-7660Article in journal (Other academic) Submitted
Abstract [en]

Machine learning-based solutions to predictive modeling problems (classification, regression, or survival analysis) typically involve a number of steps beginning with data pre-processing and ending with performance evaluation. A large number of packages providing tools for the individual steps are available for R but not for facilitating the assembly of them into complete modeling procedures or rigorously evaluating their combined performance.

We present a new package for R denoted emil (evaluation of modeling without information leakage) that is designed to be a flexible backbone of modeling procedures having the following properties:(1) Enable evaluation of performance and variable importance by means of resampling methods without introducing information leakage.(2) Return parameter tuning statistics and final prediction models.(3) Transparent, highly customizable and easy to debug structure.(4) Offer the user direct control over memory and CPU-intensive steps of the calculations.(5) Comprehensive yet concise documentation.

First we explain emil's functionality in the context of standard usage, resampling, and customization. Specific application examples are presented to show its potential in terms of parallelization, customization for survival analysis, and memory management.

The result is a computationally efficient and developer friendly framework that enables resampling based analyzes using several hundreds of thousands of variables, is easy to extend, and allows development of scalable solutions.

Keyword [en]
predictive modeling, machine learning, performance evaluation, resampling, high performance computing
National Category
Computational Mathematics
Research subject
Materials Science
URN: urn:nbn:se:uu:diva-242353OAI: oai:DiVA.org:uu-242353DiVA: diva2:783296
Swedish Foundation for Strategic Research , RBc08-008
Available from: 2015-01-25 Created: 2015-01-25 Last updated: 2015-03-11Bibliographically approved
In thesis
1. Machine Learning Based Analysis of DNA Methylation Patterns in Pediatric Acute Leukemia
Open this publication in new window or tab >>Machine Learning Based Analysis of DNA Methylation Patterns in Pediatric Acute Leukemia
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Maskininlärningsbaserad analys av DNA-metyleringsmönster i pediatrisk akut lymfatisk leukemi
Abstract [en]

Acute lymphoblastic leukemia (ALL) is the most common pediatric cancer in the Nordic countries. Recent evidence indicate that DNA methylation (DNAm) play a central role in the development and progression of the disease.

DNAm profiles of a collection of ALL patient samples and a panel of non-leukemic reference samples were analyzed using the Infinium 450k methylation assay. State-of-the-art machine learning algorithms were used to search the large amounts of data produced for patterns predictive of future relapses, in vitro drug resistance, and cytogenetic subtypes, aiming at improving our understanding of the disease and ultimately improving treatment.

In paper I, the predictive modeling framework developed to perform the analyses of DNAm dataset was presented. It focused on uncompromising statistical rigor and computational efficiency, while allowing a high level of modeling flexibility and usability. In paper II, the DNAm landscape of ALL was comprehensively characterized, discovering widespread aberrant methylation at diagnosis strongly influenced by cytogenetic subtype. The aberrantly methylated regions were enriched for genes repressed by polycomb group proteins, repressively marked histones in healthy cells, and genes associated with embryonic development. A consistent trend of hypermethylation at relapse was also discovered. In paper III, a tool for DNAm-based subtyping was presented, validated using blinded samples and used to re-classify samples with incomplete phenotypic information. Using RNA-sequencing, previously undetected non-canonical aberrations were found in many re-classified samples. In paper IV, the relationship between DNAm and in vitro drug resistance was investigated and predictive signatures were obtained for seven of the eight therapeutic drugs studied. Interpretation was challenging due to poor correlation between DNAm and gene expression, further complicated by the discovery that random subsets of the array can yield comparable classification accuracy. Paper V presents a novel Bayesian method for multivariate density estimation with variable bandwidths. Simulations showed comparable performance to the current state-of-the-art methods and an advantage on skewed distributions.

In conclusion, the studies characterize the information contained in the aberrant DNAm patterns of ALL and assess its predictive capabilities for future relapses, in vitro drug sensitivity and subtyping. They also present three publicly available tools for the scientific community to use.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2015. 68 p.
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine, ISSN 1651-6206 ; 1069
National Category
Bioinformatics (Computational Biology) Hematology Cancer and Oncology
urn:nbn:se:uu:diva-242544 (URN)978-91-554-9151-2 (ISBN)
Public defence
2015-03-13, Auditorium minus, Museum Gustavianum, Akademigatan 3, Uppsala, 14:00 (English)
Swedish Foundation for Strategic Research , RBc08-008
Available from: 2015-02-19 Created: 2015-01-27 Last updated: 2015-03-27Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Bäcklin, ChristoferGustafsson, Mats
By organisation
Cancer Pharmacology and Computational Medicine
In the same journal
Journal of Statistical Software
Computational Mathematics

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 629 hits
ReferencesLink to record
Permanent link

Direct link