uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards interoperable and reproducible QSAR analyses: Exchange of data sets
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Pharmacy, Department of Pharmaceutical Biosciences.ORCID iD: 0000-0002-8083-2864
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Pharmacy, Department of Pharmaceutical Biosciences.
NIH Chemical Genomics Center.
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Pharmacy, Department of Pharmaceutical Biosciences.
Show others and affiliations
2010 (English)In: Journal of Cheminformatics, ISSN 1758-2946, Vol. 2, 5Article in journal (Refereed) Published
Abstract [en]

BACKGROUND: QSAR/QSPR is a widely used method to relate chemical structures and responses based on ex- perimental observations. In QSAR, chemical structures are expressed as descriptors, which are mathematical representations like calculated properties or enumerated fragments. Many existing QSAR data sets are based on a combination of different software tools mixed with in-house developed solutions, with datasets manually assembled in spreadsheets. Currently there exists no agreed-upon definition of descriptors and no standard for exchanging data sets in QSAR, which together with numerous different descriptor implementations makes it a virtually impossible task to reproduce and validate analyses, and significantly hinders collaborations and re-use of data.

RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR/QSPR data sets, comprising an open XML format (QSAR-ML) and an open extensible descriptor ontology (Blue Obelisk Descriptor Ontology). The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a data set described by QSAR-ML makes its setup completely reproducible. We also provide an implementation as a set of plugins for Bioclipse that simplifies QSAR data set formation, and allows for exporting in QSAR-ML as well as traditional CSV formats. The implementation facilitates addition of new descriptor implementations, from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services.

CONCLUSIONS: Standardized QSAR data sets opens up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible dataset formation, solving the problems of defining which software components were used, their versions, and the case of multiple names for the same descriptor. This makes is easy to join, extend, combine data sets and also to work collectively. The presented Bioclipse plugins equip scientists with intuitive tools that make QSAR-ML widely available for the community.

Place, publisher, year, edition, pages
BioMed Central , 2010. Vol. 2, 5
Keyword [en]
QSAR, Bioclipse, standard, ontology, life sciences, bioinformatics, cheminformatics, reproducible
National Category
Bioinformatics and Systems Biology
Research subject
Bioinformatics
Identifiers
URN: urn:nbn:se:uu:diva-109302DOI: 10.1186/1758-2946-2-5ISI: 000208222200004PubMedID: 20591161OAI: oai:DiVA.org:uu-109302DiVA: diva2:271865
Available from: 2009-10-13 Created: 2009-10-13 Last updated: 2015-08-14Bibliographically approved
In thesis
1. Bioclipse: Integration of Data and Software in the Life Sciences
Open this publication in new window or tab >>Bioclipse: Integration of Data and Software in the Life Sciences
2009 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

New high throughput experimental techniques have turned the life sciences into a data-intensive field. Scientists are faced with new types of problems, such as managing voluminous sources of information, integrating heterogeneous data, and applying the proper analysis algorithms; all to end up with reliable conclusions. These challenges call for an infrastructure of algorithms and technologies to supply researchers with the tools and methods necessary to maximize the usefulness of the data. eScience has emerged as a promising technology to take on these challenges, and denotes integrated science carried out in highly distributed network environments, or science that makes use of large data sets and requires high performance computing resources.

In this thesis I present standards, exchange formats, algorithms, and software implementations for empowering researchers in the life sciences with the tools of eScience. The work is centered around Bioclipse - an extensible workbench developed in the frame of this thesis - which provides users with instruments for carrying out integrated research and where technical details are hidden under simple graphical interfaces. Bioclipse is a Rich Client that takes full advantage of the many offerings of eScience, such as networked databases and online services. The benefits of mixing local and remote software in a unifying platform are demonstrated with an integrated approach for predicting metabolic sites in chemical structures. To overcome the limitations of the commonly used technologies for interacting with networked services, I also present a new technology using the XMPP protocol. This enables service discovery and asynchronous communication between the client and server, which is ideal for long-running analyses.

To maximize the usefulness of the available data there is a need for standards, ontologies, and exchange formats, in order to define what information should be captured and how it should be structured and exchanged. A novel format for exchanging QSAR data sets in a fully interoperable and reproducible form is presented, together with an implementation in Bioclipse that takes advantage of eScience components during the setup process.

Bioclipse has been well received by the scientific community, attracted a large group of international users and developers, and has been awarded three international prizes for its innovative character. With continued development, the project has a good chance of becoming an important component in a sustainable infrastructure for the life sciences.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2009. 53 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Pharmacy, ISSN 1651-6192 ; 111
Keyword
Bioclipse, integration, life sciences, bioinformatics, cheminformatics, chemoinformatics, eclipse, rich client, xmpp, qsar-ml, web service, standard, ontology
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:uu:diva-109305 (URN)978-91-554-7633-5 (ISBN)
Public defence
2009-11-27, B42, Uppsala Biomedical Center (BMC), Husargatan 3, Uppsala, 13:15 (English)
Opponent
Supervisors
Available from: 2009-11-06 Created: 2009-10-13 Last updated: 2015-05-04Bibliographically approved

Open Access in DiVA

fulltext(1938 kB)91 downloads
File information
File name FULLTEXT01.pdfFile size 1938 kBChecksum SHA-512
43750bc71accd60a75147d50504f29bbf9c2357841ce21f74995957bb8529773153275983fe829f3b2ddca4c7885f6659b0457108d6ed652e0b037cbf974103f
Type fulltextMimetype application/pdf

Other links

Publisher's full textPubMed

Authority records BETA

Spjuth, OlaWillighagen, Egon

Search in DiVA

By author/editor
Spjuth, OlaWillighagen, Egon
By organisation
Department of Pharmaceutical Biosciences
In the same journal
Journal of Cheminformatics
Bioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar
Total: 91 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 645 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf