uu.seUppsala universitets publikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
eScience Approaches to Model Selection and Assessment: Applications in Bioinformatics
Uppsala universitet, Medicinska och farmaceutiska vetenskapsområdet, Farmaceutiska fakulteten, Institutionen för farmaceutisk biovetenskap. (Jarl Wikberg)
2009 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

High-throughput experimental methods, such as DNA and protein microarrays, have become ubiquitous and indispensable tools in biology and biomedicine, and the number of high-throughput technologies is constantly increasing. They provide the power to measure thousands of properties of a biological system in a single experiment and have the potential to revolutionize our understanding of biology and medicine. However, the high expectations on high-throughput methods are challenged by the problem to statistically model the wealth of data in order to translate it into concrete biological knowledge, new drugs, and clinical practices. In particular, the huge number of properties measured in high-throughput experiments makes statistical model selection and assessment exigent. To use high-throughput data in critical applications, it must be warranted that the models we construct reflect the underlying biology and are not just hypotheses suggested by the data. We must furthermore have a clear picture of the risk of making incorrect decisions based on the models.

The rapid improvements of computers and information technology have opened up new ways of how the problem of model selection and assessment can be approached. Specifically, eScience, i.e. computationally intensive science that is carried out in distributed network envi- ronments, provides computational power and means to efficiently access previously acquired scientific knowledge. This thesis investigates how we can use eScience to improve our chances of constructing biologically relevant models from high-throughput data. Novel methods for model selection and assessment that leverage on computational power and on prior scientific information to "guide" the model selection to models that a priori are likely to be relevant are proposed. In addition, a software system for deploying new methods and make them easily accessible to end users is presented.

sted, utgiver, år, opplag, sider
Uppsala: Acta Universitatis Upsaliensis , 2009. , s. 51
Serie
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Pharmacy, ISSN 1651-6192 ; 112
Emneord [en]
bioinformatics, high-throughout biology, eScience, model selection, model assessment
HSV kategori
Identifikatorer
URN: urn:nbn:se:uu:diva-109437ISBN: 978-91-554-7634-2 (tryckt)OAI: oai:DiVA.org:uu-109437DiVA, id: diva2:272464
Disputas
2009-11-28, B42, BMC, Husargatan 3, Uppsala, 10:15 (engelsk)
Opponent
Veileder
Tilgjengelig fra: 2009-11-06 Laget: 2009-10-15 Sist oppdatert: 2011-05-11bibliografisk kontrollert
Delarbeid
1. The C1C2: a framework for simultaneous model selection and assessment
Åpne denne publikasjonen i ny fane eller vindu >>The C1C2: a framework for simultaneous model selection and assessment
2008 (engelsk)Inngår i: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 9, s. 360-Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

BACKGROUND: There has been recent concern regarding the inability of predictive modeling approaches to generalize to new data. Some of the problems can be attributed to improper methods for model selection and assessment. Here, we have addressed this issue by introducing a novel and general framework, the C1C2, for simultaneous model selection and assessment. The framework relies on a partitioning of the data in order to separate model choice from model assessment in terms of used data. Since the number of conceivable models in general is vast, it was also of interest to investigate the employment of two automatic search methods, a genetic algorithm and a brute-force method, for model choice. As a demonstration, the C1C2 was applied to simulated and real-world datasets. A penalized linear model was assumed to reasonably approximate the true relation between the dependent and independent variables, thus reducing the model choice problem to a matter of variable selection and choice of penalizing parameter. We also studied the impact of assuming prior knowledge about the number of relevant variables on model choice and generalization error estimates. The results obtained with the C1C2 were compared to those obtained by employing repeated K-fold cross-validation for choosing and assessing a model. RESULTS: The C1C2 framework performed well at finding the true model in terms of choosing the correct variable subset and producing reasonable choices for the penalizing parameter, even in situations when the independent variables were highly correlated and when the number of observations was less than the number of variables. The C1C2 framework was also found to give accurate estimates of the generalization error. Prior information about the number of important independent variables improved the variable subset choice but reduced the accuracy of generalization error estimates. Using the genetic algorithm worsened the model choice but not the generalization error estimates, compared to using the brute-force method. The results obtained with repeated K-fold cross-validation were similar to those produced by the C1C2 in terms of model choice, however a lower accuracy of the generalization error estimates was observed. CONCLUSION: The C1C2 framework was demonstrated to work well for finding the true model within a penalized linear model class and accurately assess its generalization error, even for datasets with many highly correlated independent variables, a low observation-to-variable ratio, and model assumption deviations. A complete separation of the model choice and the model assessment in terms of data used for each task improves the estimates of the generalization error.

HSV kategori
Identifikatorer
urn:nbn:se:uu:diva-104211 (URN)10.1186/1471-2105-9-360 (DOI)000259742800001 ()18761753 (PubMedID)
Tilgjengelig fra: 2009-05-27 Laget: 2009-05-27 Sist oppdatert: 2018-01-13bibliografisk kontrollert
2. An eScience-Bayes strategy for analyzing omics data
Åpne denne publikasjonen i ny fane eller vindu >>An eScience-Bayes strategy for analyzing omics data
2010 (engelsk)Inngår i: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 11, s. 282-Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Background: The omics fields promise to revolutionize our understanding of biology and biomedicine. However, their potential is compromised by the challenge to analyze the huge datasets produced. Analysis of omics data is plagued by the curse of dimensionality, resulting in imprecise estimates of model parameters and performance. Moreover, the integration of omics data with other data sources is difficult to shoehorn into classical statistical models. This has resulted in ad hoc approaches to address specific problems. Results: We present a general approach to omics data analysis that alleviates these problems. By combining eScience and Bayesian methods, we retrieve scientific information and data from multiple sources and coherently incorporate them into large models. These models improve the accuracy of predictions and offer new insights into the underlying mechanisms. This "eScience-Bayes" approach is demonstrated in two proof-of-principle applications, one for breast cancer prognosis prediction from transcriptomic data and one for protein-protein interaction studies based on proteomic data. Conclusions: Bayesian statistics provide the flexibility to tailor statistical models to the complex data structures in omics biology as well as permitting coherent integration of multiple data sources. However, Bayesian methods are in general computationally demanding and require specification of possibly thousands of prior distributions. eScience can help us overcome these difficulties. The eScience-Bayes thus approach permits us to fully leverage on the advantages of Bayesian methods, resulting in models with improved predictive performance that gives more information about the underlying biological system.

sted, utgiver, år, opplag, sider
BioMed Central, 2010
HSV kategori
Identifikatorer
urn:nbn:se:uu:diva-109359 (URN)10.1186/1471-2105-11-282 (DOI)000279732900004 ()20504364 (PubMedID)
Tilgjengelig fra: 2009-10-14 Laget: 2009-10-14 Sist oppdatert: 2018-01-12bibliografisk kontrollert
3. SimSel: a new simulation method for variable selection
Åpne denne publikasjonen i ny fane eller vindu >>SimSel: a new simulation method for variable selection
2012 (engelsk)Inngår i: Journal of Statistical Computation and Simulation, ISSN 0094-9655, E-ISSN 1563-5163, Vol. 82, nr 4, s. 515-527Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

We propose a new simulation method, SimSel, for variable selection in linear and nonlinear modelling problems. SimSel works by disturbing the input data with pseudo-errors. We then study how this disturbance affects the quality of an approximative model fitted to the data. The main idea is that disturbing unimportant variables does not affect the quality of the model fit. The use of an approximative model has the advantage that the true underlying function does not need to be known and that the method becomes insensitive to model misspecifications. We demonstrate SimSel on simulated data from linear and nonlinear models and on two real data sets. The simulation studies suggest that SimSel works well in complicated situations, such as nonlinear errors-in-variable models.

Emneord
variable selection, simulation method, pseudo-error, pseudo-variable
HSV kategori
Identifikatorer
urn:nbn:se:uu:diva-109360 (URN)10.1080/00949655.2010.543981 (DOI)000303234800003 ()
Tilgjengelig fra: 2009-10-14 Laget: 2009-10-14 Sist oppdatert: 2018-01-12bibliografisk kontrollert
4. Ridge-SimSel: A generalization of the variable selection method SimSel to multicollinear data sets
Åpne denne publikasjonen i ny fane eller vindu >>Ridge-SimSel: A generalization of the variable selection method SimSel to multicollinear data sets
(engelsk)Artikkel i tidsskrift (Fagfellevurdert) Submitted
HSV kategori
Identifikatorer
urn:nbn:se:uu:diva-109361 (URN)
Tilgjengelig fra: 2009-10-14 Laget: 2009-10-14 Sist oppdatert: 2012-07-26bibliografisk kontrollert
5. Bioclipse: an open source workbench for chemo- and bioinformatics
Åpne denne publikasjonen i ny fane eller vindu >>Bioclipse: an open source workbench for chemo- and bioinformatics
Vise andre…
2007 (engelsk)Inngår i: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 8, s. 59-Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

BACKGROUND: There is a need for software applications that provide users with a complete and extensible toolkit for chemo- and bioinformatics accessible from a single workbench. Commercial packages are expensive and closed source, hence they do not allow end users to modify algorithms and add custom functionality. Existing open source projects are more focused on providing a framework for integrating existing, separately installed bioinformatics packages, rather than providing user-friendly interfaces. No open source chemoinformatics workbench has previously been published, and no successful attempts have been made to integrate chemo- and bioinformatics into a single framework. RESULTS: Bioclipse is an advanced workbench for resources in chemo- and bioinformatics, such as molecules, proteins, sequences, spectra, and scripts. It provides 2D-editing, 3D-visualization, file format conversion, calculation of chemical properties, and much more; all fully integrated into a user-friendly desktop application. Editing supports standard functions such as cut and paste, drag and drop, and undo/redo. Bioclipse is written in Java and based on the Eclipse Rich Client Platform with a state-of-the-art plugin architecture. This gives Bioclipse an advantage over other systems as it can easily be extended with functionality in any desired direction. CONCLUSION: Bioclipse is a powerful workbench for bio- and chemoinformatics as well as an advanced integration platform. The rich functionality, intuitive user interface, and powerful plugin architecture make Bioclipse the most advanced and user-friendly open source workbench for chemo- and bioinformatics. Bioclipse is released under Eclipse Public License (EPL), an open source license which sets no constraints on external plugin licensing; it is totally open for both open source plugins as well as commercial ones. Bioclipse is freely available at http://www.bioclipse.net.

HSV kategori
Identifikatorer
urn:nbn:se:uu:diva-104257 (URN)10.1186/1471-2105-8-59 (DOI)000244600100001 ()17316423 (PubMedID)
Tilgjengelig fra: 2009-05-28 Laget: 2009-05-28 Sist oppdatert: 2018-01-13bibliografisk kontrollert

Open Access i DiVA

fulltekst(1890 kB)395 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 1890 kBChecksum SHA-512
a310f88deb5768adf6e7fbeff9da9a8cb444236be34e7ddf03b9a6beb1dacf3dcd9d8e45734c1d26e2351a305f0d86d8795c7a7cd1e45913c98d4da6196ab271
Type fulltextMimetype application/pdf
Kjøp publikasjonen >>

Personposter BETA

Eklund, Martin

Søk i DiVA

Av forfatter/redaktør
Eklund, Martin
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 395 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

isbn
urn-nbn

Altmetric

isbn
urn-nbn
Totalt: 1083 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf