uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
External cross-validation for unbiased evaluation of protein family detectors: application to allergens
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Medical Sciences, Clinical Pharmacology. (Cancer pharmacology and informatics (Rolf Larsson))
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Medical Sciences, Clinical Pharmacology. (Cancer pharmacology and informatics (Rolf Larsson))
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Medical Sciences, Clinical Pharmacology. (Cancer pharmacology and informatics (Rolf Larsson))
Show others and affiliations
2005 (English)In: Proteins, ISSN 0887-3585, Vol. 61, no 4, 918-925 p.Article in journal (Refereed) Published
Abstract [en]

Key issues in protein science and computational biology are design and evaluation of algorithms aimed at detection of proteins that belong to a specific family, as defined by structural, evolutionary, or functional criteria. In this context, several validation techniques are often used to compare different parameter settings of the detector, and to subsequently select the setting that yields the smallest error rate estimate. A frequently overlooked problem associated with this approach is that this smallest error rate estimate may have a large optimistic bias. Based on computer simulations, we show that a detector's error rate estimate can be overly optimistic and propose a method to obtain unbiased performance estimates of a detector design procedure. The method is founded on an external 10-fold cross-validation (CV) loop that embeds an internal validation procedure used for parameter selection in detector design. The designed detector generated in each of the 10 iterations are evaluated on held-out examples exclusively available in the external CV iterations. Notably, the average of these 10 performance estimates is not associated with a final detector, but rather with the average performance of the design procedure used. We apply the external CV loop to the particular problem of detecting potentially allergenic proteins, using a previously reported design procedure. Unbiased performance estimates of the allergen detector design procedure are presented together with information about which algorithms and parameter settings that are most frequently selected.

Place, publisher, year, edition, pages
2005. Vol. 61, no 4, 918-925 p.
Keyword [en]
protein classification, bias, bioinformatics, supervised learning, allergy, sequence alignment
National Category
Medical and Health Sciences
Identifiers
URN: urn:nbn:se:uu:diva-76006DOI: 10.1002/prot.20656PubMedID: 16231294OAI: oai:DiVA.org:uu-76006DiVA: diva2:103917
Available from: 2007-02-12 Created: 2007-02-12 Last updated: 2009-07-22Bibliographically approved
In thesis
1. Novel Computational Analyses of Allergens for Improved Allergenicity Risk Assessment and Characterization of IgE Reactivity Relationships
Open this publication in new window or tab >>Novel Computational Analyses of Allergens for Improved Allergenicity Risk Assessment and Characterization of IgE Reactivity Relationships
2008 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Immunoglobulin E (IgE) mediated allergy is a major and seemingly increasing health problem in the Western countries. The combined usage of databases of molecular and clinical information on allergens (allergenic proteins) as well as new experimental platforms capable of generating huge amounts of allergy-related data from a single blood test holds great potential to enhance our knowledge of this complex disease. To maximally benefit from this development, however, both novel and improved methods for computational analysis are urgently required. This thesis concerns two types of important and practical computational analyses of allergens: allergenicity/IgE-cross-reactivity risk assessment and characterization of IgE-reactivity patterns. Both directions rely on development and implementation of bioinformatics and statistical learning algorithms, which are applied to either amino acid sequence information of allergenic proteins or on quantified human blood serum levels of specific IgE-antibodies to allergen preparations (purified extracts of allergenic sources, such as e.g. peanut or birch).

The main application for computational risk assessment of allergenicity is to prevent unintentional introduction of allergen-encoding transgenes in genetically modified (GM) food crops. Two separate classification procedures for potential protein allergenicity are introduced. Both protocols rely on multivariate classification algorithms that are educated to discriminate allergens from presumable non-allergens based on their amino acid sequence. Both classification procedures are thoroughly evaluated and the second protocol shows state-of-the-art performance in comparison to current top-ranked methods. Moreover, several pitfalls in performance estimation of classifiers are demonstrated and procedures to circumvent these are suggested.

Visualization and characterization of IgE-reactivity patterns among allergen preparations are enabled by application of bioinformatics and statistical learning methods to a multivariate dataset holding recorded blood serum IgE-levels of over 1000 sensitized individuals, each measured to 89 allergen preparations. Moreover, a novel framework for divisive hierarchical clustering including graphical representation of the resulting output is introduced, which greatly simplifies analysis of the abovementioned dataset. Important IgE-reactivity relationships within several groups of allergen preparations are identified including well-known groups of clinically relevant cross-reactivities.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2008. 65 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine, ISSN 1651-6206 ; 385
Keyword
allergens, bioinformatics, statistical learning, performance estimation, risk assessment
National Category
Biomedical Laboratory Science/Technology
Identifiers
urn:nbn:se:uu:diva-9313 (URN)978-91-554-7308-2 (ISBN)
Public defence
2008-11-07, Lärosal IV, Universitetshuset, Uppsala, 13:00 (English)
Opponent
Supervisors
Available from: 2008-10-17 Created: 2008-10-17 Last updated: 2009-05-12Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textPubMed

Authority records BETA

Isaksson, AndersGustafsson, Mats G.

Search in DiVA

By author/editor
Isaksson, AndersGustafsson, Mats G.
By organisation
Clinical Pharmacology
Medical and Health Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 567 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf