uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
The C1C2: a framework for simultaneous model selection and assessment
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Pharmacy, Department of Pharmaceutical Biosciences, Pharmaceutical Pharmacology. (Proteochemometric group)
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Pharmacy, Department of Pharmaceutical Biosciences, Pharmaceutical Pharmacology. (Proteochemometric group)ORCID iD: 0000-0002-8083-2864
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Pharmacy, Department of Pharmaceutical Biosciences, Pharmaceutical Pharmacology. (Proteochemometric group)
2008 (English)In: BMC Bioinformatics, ISSN 1471-2105, Vol. 9, 360- p.Article in journal (Refereed) Published
Abstract [en]

BACKGROUND: There has been recent concern regarding the inability of predictive modeling approaches to generalize to new data. Some of the problems can be attributed to improper methods for model selection and assessment. Here, we have addressed this issue by introducing a novel and general framework, the C1C2, for simultaneous model selection and assessment. The framework relies on a partitioning of the data in order to separate model choice from model assessment in terms of used data. Since the number of conceivable models in general is vast, it was also of interest to investigate the employment of two automatic search methods, a genetic algorithm and a brute-force method, for model choice. As a demonstration, the C1C2 was applied to simulated and real-world datasets. A penalized linear model was assumed to reasonably approximate the true relation between the dependent and independent variables, thus reducing the model choice problem to a matter of variable selection and choice of penalizing parameter. We also studied the impact of assuming prior knowledge about the number of relevant variables on model choice and generalization error estimates. The results obtained with the C1C2 were compared to those obtained by employing repeated K-fold cross-validation for choosing and assessing a model. RESULTS: The C1C2 framework performed well at finding the true model in terms of choosing the correct variable subset and producing reasonable choices for the penalizing parameter, even in situations when the independent variables were highly correlated and when the number of observations was less than the number of variables. The C1C2 framework was also found to give accurate estimates of the generalization error. Prior information about the number of important independent variables improved the variable subset choice but reduced the accuracy of generalization error estimates. Using the genetic algorithm worsened the model choice but not the generalization error estimates, compared to using the brute-force method. The results obtained with repeated K-fold cross-validation were similar to those produced by the C1C2 in terms of model choice, however a lower accuracy of the generalization error estimates was observed. CONCLUSION: The C1C2 framework was demonstrated to work well for finding the true model within a penalized linear model class and accurately assess its generalization error, even for datasets with many highly correlated independent variables, a low observation-to-variable ratio, and model assumption deviations. A complete separation of the model choice and the model assessment in terms of data used for each task improves the estimates of the generalization error.

Place, publisher, year, edition, pages
2008. Vol. 9, 360- p.
National Category
Pharmaceutical Sciences
Identifiers
URN: urn:nbn:se:uu:diva-104211DOI: 10.1186/1471-2105-9-360ISI: 000259742800001PubMedID: 18761753OAI: oai:DiVA.org:uu-104211DiVA: diva2:219542
Available from: 2009-05-27 Created: 2009-05-27 Last updated: 2015-05-04Bibliographically approved
In thesis
1. eScience Approaches to Model Selection and Assessment: Applications in Bioinformatics
Open this publication in new window or tab >>eScience Approaches to Model Selection and Assessment: Applications in Bioinformatics
2009 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

High-throughput experimental methods, such as DNA and protein microarrays, have become ubiquitous and indispensable tools in biology and biomedicine, and the number of high-throughput technologies is constantly increasing. They provide the power to measure thousands of properties of a biological system in a single experiment and have the potential to revolutionize our understanding of biology and medicine. However, the high expectations on high-throughput methods are challenged by the problem to statistically model the wealth of data in order to translate it into concrete biological knowledge, new drugs, and clinical practices. In particular, the huge number of properties measured in high-throughput experiments makes statistical model selection and assessment exigent. To use high-throughput data in critical applications, it must be warranted that the models we construct reflect the underlying biology and are not just hypotheses suggested by the data. We must furthermore have a clear picture of the risk of making incorrect decisions based on the models.

The rapid improvements of computers and information technology have opened up new ways of how the problem of model selection and assessment can be approached. Specifically, eScience, i.e. computationally intensive science that is carried out in distributed network envi- ronments, provides computational power and means to efficiently access previously acquired scientific knowledge. This thesis investigates how we can use eScience to improve our chances of constructing biologically relevant models from high-throughput data. Novel methods for model selection and assessment that leverage on computational power and on prior scientific information to "guide" the model selection to models that a priori are likely to be relevant are proposed. In addition, a software system for deploying new methods and make them easily accessible to end users is presented.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2009. 51 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Pharmacy, ISSN 1651-6192 ; 112
Keyword
bioinformatics, high-throughout biology, eScience, model selection, model assessment
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:uu:diva-109437 (URN)978-91-554-7634-2 (ISBN)
Public defence
2009-11-28, B42, BMC, Husargatan 3, Uppsala, 10:15 (English)
Opponent
Supervisors
Available from: 2009-11-06 Created: 2009-10-15 Last updated: 2011-05-11Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textPubMed

Authority records BETA

Spjuth, Ola

Search in DiVA

By author/editor
Spjuth, Ola
By organisation
Pharmaceutical Pharmacology
In the same journal
BMC Bioinformatics
Pharmaceutical Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 471 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf