uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Supervised Learning Techniques: A comparison of the Random Forest and the Support Vector Machine
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Social Sciences, Department of Statistics.
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Social Sciences, Department of Statistics.
2016 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

This thesis examines the performance of the support vector machine and the random forest models in the context of binary classification. The two techniques are compared and the outstanding one is used to construct a final parsimonious model. The data set consists of 33 observations and 89 biomarkers as features with no known dependent variable. The dependent variable is generated through k-means clustering, with a predefined final solution of two clusters. The training of the algorithms is performed using five-fold cross-validation repeated twenty times. The outcome of the training process reveals that the best performing versions of the models are a linear support vector machine and a random forest with six randomly selected features at each split. The final results of the comparison on the test set of these optimally tuned algorithms show that the random forest outperforms the linear kernel support vector machine. The former classifies all observations in the test set correctly whilst the latter classifies all but one correctly. Hence, a parsimonious random forest model using the top five features is constructed, which, to conclude, performs equally well on the test set compared to the original random forest model using all features.

Place, publisher, year, edition, pages
2016. , 57 p.
Keyword [en]
machine learning, biomarkers, cross-validation, receiver operating characteristic, k-means clustering, feature selection, binary classification
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:uu:diva-274768OAI: oai:DiVA.org:uu-274768DiVA: diva2:897594
External cooperation
Pharma Consulting Group
Subject / course
Statistics
Supervisors
Examiners
Available from: 2016-02-10 Created: 2016-01-26 Last updated: 2016-02-10Bibliographically approved

Open Access in DiVA

fulltext(1487 kB)189 downloads
File information
File name FULLTEXT01.pdfFile size 1487 kBChecksum SHA-512
980d52fa75a2b91cd66d975d8df02fadbb6ec5ed190241582fd361d713c370d067acbfc0c453c728ff5df9e10e5c445e6407c0e0170ac1d246cd827f3ba0c0f3
Type fulltextMimetype application/pdf

By organisation
Department of Statistics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 189 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1228 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf