uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Monte Carlo feature selection and interdependency discovery in supervised classification
Institute of Computer Science, Polish Academy of Sciences.
Uppsala University, Disciplinary Domain of Science and Technology, Biology, Department of Cell and Molecular Biology, The Linnaeus Centre for Bioinformatics. (Jan Komorowski's)
Institute of Computer Science, Polish Academy of Sciences.
Uppsala University, Disciplinary Domain of Science and Technology, Biology, Department of Cell and Molecular Biology, The Linnaeus Centre for Bioinformatics.
2010 (English)In: Advances in Machine Learning: Dedicated to the memory of Professor Ryszard S. Michalski., Heidelberg: Springer , 2010Chapter in book (Other academic)
Abstract [en]

Applications of machine learning techniques in Life Sciences are the main applications forcing a paradigm shift in the way these techniques are used. Rather than obtaining the best possible supervised classifier, the Life Scientist needs to know which features contribute best to classifying distinct classes and what are the interdependencies between the features. To this end we significantly extend our earlier work [Dramiński et al. (2008)] that introduced an effective and reliable method for ranking features according to their importance for classification. We begin with adding a method for finding a cut-off between informative and non-informative fea- tures and then continue with a development of a methodology and an implementa- tion of a procedure for determining interdependencies between informative features. The reliability of our approach rests on multiple construction of tree classifiers. Essentially, each classifier is trained on a randomly chosen subset of the original data using only a fraction of all of the observed features. This approach is conceptually simple yet computer-intensive. The methodology is validated on a large and difficult task of modelling HIV-1 reverse transcriptase resistance to drugs which is a good example of the aforementioned paradigm shift. We construct a classifier but of the main interest is the identification of mutation points (i.e. features) and their combinations that model drug resistance.

Place, publisher, year, edition, pages
Heidelberg: Springer , 2010.
Series
Studies in Computational Intelligence, ISSN 1860-949X ; 263
National Category
Computer Sciences Microbiology in the medical area
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:uu:diva-109834ISBN: 978-3-642-05178-4 (print)OAI: oai:DiVA.org:uu-109834DiVA, id: diva2:274118
Projects
feature selection, interdependency discovery, MCFS-ID, biological sequence analysisAvailable from: 2009-11-05 Created: 2009-10-27 Last updated: 2018-01-12Bibliographically approved
In thesis
1. From Physicochemical Features to Interdependency Networks: A Monte Carlo Approach to Modeling HIV-1 Resistome and Post-translational Modifications
Open this publication in new window or tab >>From Physicochemical Features to Interdependency Networks: A Monte Carlo Approach to Modeling HIV-1 Resistome and Post-translational Modifications
2009 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The availability of new technologies supplied life scientists with large amounts of experimental data. The data sets are large not only in terms of the number of observations, but also in terms of the number of recorded features. One of the aims of modeling is to explain a given phenomenon in possibly the simplest way, hence the need for selection of suitable features.

We extended a Monte Carlo-based approach to selecting statistically significant features with discovery of feature interdependencies and used it in modeling sequence-function relationships in proteins. Our approach led to compact and easy-to-interpret predictive models.

First, we represented protein sequences in terms of their physicochemical properties. This was followed by our feature selection and discovery of feature interdependencies. Finally, predictive models based on e.g., decision trees or rough sets were constructed.

We applied the method to model two important biological problems: 1) HIV-1 resistance to reverse transcriptase-targeted drugs and 2) post-translational modifications of proteins.

In the case of HIV resistance, we were not only able to predict whether the mutated protein is resistant to a drug or not, but we also suggested some new, previously neglected, mutations that possibly contribute to drug resistance. For all these mutations we proposed probable molecular mechanisms of action using literature and 3D structure studies.

In the case of predicting PTMs, we built high accuracy models of modifications. In comparison to other methods, we were able to resolve whether the closest neighborhood of a residue (the nanomer) is sufficient to determine its modification status. Importantly, the application of our method yields networks of interdependent physicochemical properties of amino acids that show how these properties collaborate in establishing a given modification.

We believe that the presented methods will help researchers to analyze a large class of important biological problems and will guide them in their research.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2009. p. 89
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 688
Keywords
bioinformatics, HIV-1, resistome analysis, drug resistance, predicting PTMs, molecular interdependency networks, MCFS-ID, feature selection, interactome, machine-learning, rough sets
National Category
Bioinformatics and Systems Biology
Research subject
Computer Science; Biology, with specialization in structural biology
Identifiers
urn:nbn:se:uu:diva-109873 (URN)978-91-554-7650-2 (ISBN)
Public defence
2009-12-15, C8:305, BMC, Husargatan 3, Uppsala, 09:15 (English)
Opponent
Supervisors
Available from: 2009-11-18 Created: 2009-10-28 Last updated: 2009-11-18Bibliographically approved

Open Access in DiVA

fulltext(996 kB)1091 downloads
File information
File name FULLTEXT01.pdfFile size 996 kBChecksum SHA-512
e44ce28b288f796a7067331f20f6c2dd6920e99e4e53c3bf1200b093b7c25fd57b2b7b2360387afd3a39bd61bd756639d0799987ee8043d42075e8fc355fde42
Type fulltextMimetype application/pdf

Other links

http://www.springer.com/engineering/book/978-3-642-05178-4

Authority records BETA

Kierczak, Marcin

Search in DiVA

By author/editor
Kierczak, Marcin
By organisation
The Linnaeus Centre for Bioinformatics
Computer SciencesMicrobiology in the medical area

Search outside of DiVA

GoogleGoogle Scholar
Total: 1091 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 770 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf