uu.seUppsala University Publications
Change search
ReferencesLink to record
Permanent link

Direct link
Random Reducts: A Monte Carlo Rough Set-based Method for Feature Selection in Large Datasets
Uppsala University, Disciplinary Domain of Science and Technology, Biology, Department of Cell and Molecular Biology, Computational and Systems Biology.
Show others and affiliations
2013 (English)In: Fundamenta Informaticae, ISSN 0169-2968, Vol. 127, no 1-4, 273-288 p.Article in journal (Refereed) Published
Abstract [en]

An important step prior to constructing a classifier for a very large data set is feature selection. With many problems it is possible to find a subset of attributes that have the same discriminative power as the full data set. There are many feature selection methods but in none of them are Rough Set models tied up with statistical argumentation. Moreover, known methods of feature selection usually discard shadowed features, i.e. those carrying the same or partially the same information as the selected features. In this study we present Random Reducts (RR) - a feature selection method which precedes classification per se. The method is based on the Monte Carlo Feature Selection (MCFS) layout and uses Rough Set Theory in the feature selection process. On synthetic data, we demonstrate that the method is able to select otherwise shadowed features of which the user should be made aware, and to find interactions in the data set.

Place, publisher, year, edition, pages
2013. Vol. 127, no 1-4, 273-288 p.
National Category
Bioinformatics (Computational Biology)
URN: urn:nbn:se:uu:diva-206127DOI: 10.3233/FI-2013-909ISI: 000325745600021OAI: oai:DiVA.org:uu-206127DiVA: diva2:643671
Available from: 2013-08-28 Created: 2013-08-28 Last updated: 2013-11-14Bibliographically approved
In thesis
1. Rule-Based Approaches for Large Biological Datasets Analysis: A Suite of Tools and Methods
Open this publication in new window or tab >>Rule-Based Approaches for Large Biological Datasets Analysis: A Suite of Tools and Methods
2013 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis is about new and improved computational methods to analyze complex biological data produced by advanced biotechnologies. Such data is not only very large but it also is characterized by very high numbers of features. Addressing these needs, we developed a set of methods and tools that are suitable to analyze large sets of data, including next generation sequencing data, and built transparent models that may be interpreted by researchers not necessarily expert in computing. We focused on brain related diseases.

The first aim of the thesis was to employ the meta-server approach to finding peaks in ChIP-seq data. Taking existing peak finders we created an algorithm that produces consensus results better than any single peak finder.

The second aim was to use supervised machine learning to identify features that are significant in predictive diagnosis of Alzheimer disease in patients with mild cognitive impairment. This experience led to a development of a better feature selection method for rough sets, a machine learning method. 

The third aim was to deepen the understanding of the role that STAT3 transcription factor plays in gliomas. Interestingly, we found that STAT3 in addition to being an activator is also a repressor in certain glioma rat and human models. This was achieved by analyzing STAT3 binding sites in combination with epigenetic marks. STAT3 regulation was determined using expression data of untreated cells and cells after JAK2/STAT3 inhibition.

The four papers constituting the thesis are preceded by an exposition of the biological, biotechnological and computational background that provides foundations for the papers.

The overall results of this thesis are witness of the mutually beneficial relationship played by Bioinformatics in modern Life Sciences and Computer Science.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2013. 40 p.
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1066
Rough sets, peak finding, gliomas, Alzheimer disease, STAT3, machine learning, feature selection, next generation sequencing
National Category
Cell and Molecular Biology Bioinformatics and Systems Biology Bioinformatics (Computational Biology)
urn:nbn:se:uu:diva-206137 (URN)978-91-554-8733-1 (ISBN)
Public defence
2013-10-11, C8:301, Husargatan 3, Uppsala, 13:00 (English)
Available from: 2013-09-19 Created: 2013-08-28 Last updated: 2014-01-23

Open Access in DiVA

No full text

Other links

Publisher's full text

Search in DiVA

By author/editor
Baltzer, NicholasKomorowski, Jan
By organisation
Computational and Systems Biology
In the same journal
Fundamenta Informaticae
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 244 hits
ReferencesLink to record
Permanent link

Direct link