uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Reproducible Data Analysis in Drug Discovery with Scientific Workflows and the Semantic Web
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Pharmacy, Department of Pharmaceutical Biosciences. Department of Biochemistry and Biophysics, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Stockholm University, Stockholm, Sweden. (Pharmaceutical Bioinformatics)ORCID iD: 0000-0001-6740-9212
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The pharmaceutical industry is facing a research and development productivity crisis. At the same time we have access to more biological data than ever from recent advancements in high-throughput experimental methods. One suggested explanation for this apparent paradox has been that a crisis in reproducibility has affected also the reliability of datasets providing the basis for drug development. Advanced computing infrastructures can to some extent aid in this situation but also come with their own challenges, including increased technical debt and opaqueness from the many layers of technology required to perform computations and manage data. In this thesis, a number of approaches and methods for dealing with data and computations in early drug discovery in a reproducible way are developed. This has been done while striving for a high level of simplicity in their implementations, to improve understandability of the research done using them. Based on identified problems with existing tools, two workflow tools have been developed with the aim to make writing complex workflows particularly in predictive modelling more agile and flexible. One of the tools is based on the Luigi workflow framework, while the other is written from scratch in the Go language. We have applied these tools on predictive modelling problems in early drug discovery to create reproducible workflows for building predictive models, including for prediction of off-target binding in drug discovery. We have also developed a set of practical tools for working with linked data in a collaborative way, and publishing large-scale datasets in a semantic, machine-readable format on the web. These tools were applied on demonstrator use cases, and used for publishing large-scale chemical data. It is our hope that the developed tools and approaches will contribute towards practical, reproducible and understandable handling of data and computations in early drug discovery.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2018. , p. 68
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Pharmacy, ISSN 1651-6192 ; 256
Keywords [en]
Reproducibility, Scientific Workflow Management Systems, Workflows, Pipelines, Flow-based programming, Predictive modelling, Semantic Web, Linked Data, Semantic MediaWiki, MediaWiki, RDF, SPARQL, Golang
Keywords [sv]
Reproducerbarhet, Arbetsflödeshanteringssystem, Flödesbaserad programmering, Prediktiv modellering, Semantiska webben, Länkade data, Go
National Category
Pharmacology and Toxicology Bioinformatics (Computational Biology)
Research subject
Bioinformatics; Pharmacology
Identifiers
URN: urn:nbn:se:uu:diva-358353ISBN: 978-91-513-0427-4 (print)OAI: oai:DiVA.org:uu-358353DiVA, id: diva2:1242336
Public defence
2018-09-28, Room B22, Biomedicinskt Centrum, Husargatan 3, Uppsala, 13:00 (English)
Opponent
Supervisors
Funder
EU, Horizon 2020, 654241Swedish e‐Science Research CentereSSENCE - An eScience CollaborationAvailable from: 2018-09-04 Created: 2018-08-28 Last updated: 2018-09-10
List of papers
1. Large-scale ligand-based predictive modelling using support vector machines
Open this publication in new window or tab >>Large-scale ligand-based predictive modelling using support vector machines
Show others...
2016 (English)In: Journal of Cheminformatics, ISSN 1758-2946, E-ISSN 1758-2946, Vol. 8, article id 39Article in journal (Refereed) Published
Abstract [en]

The increasing size of datasets in drug discovery makes it challenging to build robust and accurate predictive models within a reasonable amount of time. In order to investigate the effect of dataset sizes on predictive performance and modelling time, ligand-based regression models were trained on open datasets of varying sizes of up to 1.2 million chemical structures. For modelling, two implementations of support vector machines (SVM) were used. Chemical structures were described by the signatures molecular descriptor. Results showed that for the larger datasets, the LIBLINEAR SVM implementation performed on par with the well-established libsvm with a radial basis function kernel, but with dramatically less time for model building even on modest computer resources. Using a non-linear kernel proved to be infeasible for large data sizes, even with substantial computational resources on a computer cluster. To deploy the resulting models, we extended the Bioclipse decision support framework to support models from LIBLINEAR and made our models of logD and solubility available from within Bioclipse.

Keywords
Predictive modelling; Support vector machine; Bioclipse; Molecular signatures; QSAR
National Category
Pharmaceutical Sciences Bioinformatics (Computational Biology)
Research subject
Bioinformatics
Identifiers
urn:nbn:se:uu:diva-248959 (URN)10.1186/s13321-016-0151-5 (DOI)000381186100001 ()27516811 (PubMedID)
Funder
Swedish National Infrastructure for Computing (SNIC), b2013262 b2015001Science for Life Laboratory - a national resource center for high-throughput molecular bioscienceeSSENCE - An eScience Collaboration
Available from: 2015-04-09 Created: 2015-04-09 Last updated: 2018-08-28Bibliographically approved
2. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles
Open this publication in new window or tab >>Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles
2016 (English)In: Journal of Cheminformatics, ISSN 1758-2946, E-ISSN 1758-2946, Vol. 8, article id 67Article in journal (Refereed) Published
Abstract [en]

Predictive modelling in drug discovery is challenging to automate as it often contains multiple analysis steps and might involve cross-validation and parameter tuning that create complex dependencies between tasks. With large-scale data or when using computationally demanding modelling methods, e-infrastructures such as high-performance or cloud computing are required, adding to the existing challenges of fault-tolerant automation. Workflow management systems can aid in many of these challenges, but the currently available systems are lacking in the functionality needed to enable agile and flexible predictive modelling. We here present an approach inspired by elements of the flow-based programming paradigm, implemented as an extension of the Luigi system which we name SciLuigi. We also discuss the experiences from using the approach when modelling a large set of biochemical interactions using a shared computer cluster.

Keywords
Predictive modelling, Machine learning, Workflows, Drug discovery, Flow-based programming
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-315089 (URN)10.1186/s13321-016-0179-6 (DOI)000391703900001 ()
Funder
eSSENCE - An eScience CollaborationSwedish e‐Science Research CenterSwedish National Infrastructure for Computing (SNIC), b2013262
Available from: 2017-02-09 Created: 2017-02-09 Last updated: 2018-08-28Bibliographically approved
3. RDFIO: extending Semantic MediaWiki for interoperable biomedical data management
Open this publication in new window or tab >>RDFIO: extending Semantic MediaWiki for interoperable biomedical data management
Show others...
2017 (English)In: Journal of Biomedical Semantics, ISSN 2041-1480, E-ISSN 2041-1480, Vol. 8, article id 35Article in journal (Refereed) Published
Abstract [en]

BACKGROUND: Biological sciences are characterised not only by an increasing amount but also the extreme complexity of its data. This stresses the need for efficient ways of integrating these data in a coherent description of biological systems. In many cases, biological data needs organization before integration. This is not seldom a collaborative effort, and it is thus important that tools for data integration support a collaborative way of working. Wiki systems with support for structured semantic data authoring, such as Semantic MediaWiki, provide a powerful solution for collaborative editing of data combined with machine-readability, so that data can be handled in an automated fashion in any downstream analyses. Semantic MediaWiki lacks a built-in data import function though, which hinders efficient round-tripping of data between interoperable Semantic Web formats such as RDF and the internal wiki format.

RESULTS: To solve this deficiency, the RDFIO suite of tools is presented, which supports importing of RDF data into Semantic MediaWiki, with metadata needed to export it again in the same RDF format, or ontology. Additionally, the new functionality enables mash-ups of automated data imports combined with manually created data presentations. The application of the suite of tools is demonstrated by importing drug discovery related data about rare diseases from Orphanet and acid dissociation constants from Wikidata. The RDFIO suite of tools is freely available for download via pharmb.io/project/rdfio .

CONCLUSIONS: Through a set of biomedical demonstrators, it is demonstrated how the new functionality enables a number of usage scenarios where the interoperability of SMW and the wider Semantic Web is leveraged for biomedical data sets, to create an easy to use and flexible platform for exploring and working with biomedical data.

Keywords
MediaWiki, RDF, SPARQL, Semantic MediaWiki, Semantic Web, Wiki, Wikidata
National Category
Bioinformatics (Computational Biology)
Research subject
Bioinformatics
Identifiers
urn:nbn:se:uu:diva-329195 (URN)10.1186/s13326-017-0136-y (DOI)000409081000001 ()28870259 (PubMedID)
Funder
eSSENCE - An eScience CollaborationSwedish e‐Science Research CenterEU, FP7, Seventh Framework Programme
Available from: 2017-09-10 Created: 2017-09-10 Last updated: 2018-08-28Bibliographically approved
4. A confidence predictor for logD using conformal regression and a support-vector machine
Open this publication in new window or tab >>A confidence predictor for logD using conformal regression and a support-vector machine
Show others...
2018 (English)In: Journal of Cheminformatics, ISSN 1758-2946, E-ISSN 1758-2946, Vol. 10, no 1, article id 17Article in journal (Refereed) Published
Abstract [en]

Lipophilicity is a major determinant of ADMET properties and overall suitability of drug candidates. We have developed large-scale models to predict water-octanol distribution coefficient (logD) for chemical compounds, aiding drug discovery projects. Using ACD/logD data for 1.6 million compounds from the ChEMBL database, models are created and evaluated by a support-vector machine with a linear kernel using conformal prediction methodology, outputting prediction intervals at a specified confidence level. The resulting model shows a predictive ability of [Formula: see text] and with the best performing nonconformity measure having median prediction interval of [Formula: see text] log units at 80% confidence and [Formula: see text] log units at 90% confidence. The model is available as an online service via an OpenAPI interface, a web page with a molecular editor, and we also publish predictive values at 90% confidence level for 91 M PubChem structures in RDF format for download and as an URI resolver service.

Keywords
Conformal prediction, LogD, Machine learning, QSAR, RDF, Support-vector machine
National Category
Bioinformatics (Computational Biology)
Research subject
Bioinformatics
Identifiers
urn:nbn:se:uu:diva-347779 (URN)10.1186/s13321-018-0271-1 (DOI)000429065900001 ()29616425 (PubMedID)
Funder
EU, Horizon 2020, 731075
Available from: 2018-04-06 Created: 2018-04-06 Last updated: 2018-08-28Bibliographically approved
5. Predicting off-target binding profiles with confidence using Conformal Prediction
Open this publication in new window or tab >>Predicting off-target binding profiles with confidence using Conformal Prediction
Show others...
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Ligand-based models can be used in drug discovery to obtain an early indication of potential off-target interactions that could be linked to adverse effects. Another application is to combine such models into a panel, allowing to compare and search for compounds with similar profiles. Most contemporary methods and implementations however lack valid measures of confidence in their predictions, and only providing point predictions. We here describe the use of conformal prediction for predicting off-target interactions with models trained on data from 31 targets in the ExCAPE dataset, selected for their utility in broad early hazard assessment. Chemicals were represented by the signature molecular descriptor and support vector machines were used as the underlying machine learning method. By using conformal prediction, the results from predictions come in the form of confidence p-values for each class. The full pre-processing and model training process is openly available as scientific workflows on GitHub, rendering it fully reproducible. We illustrate the usefulness of the methodology on a set of compounds extracted from DrugBank. The resulting models are published online and are available via a graphical web interface and an OpenAPI interface for programmatic access.

Keywords
target profiles, predictive modelling, conformal prediction, machine learning, off-target, adverse effects
National Category
Pharmacology and Toxicology
Research subject
Pharmacology
Identifiers
urn:nbn:se:uu:diva-357894 (URN)
Funder
EU, Horizon 2020, 731075
Available from: 2018-08-21 Created: 2018-08-21 Last updated: 2018-09-04Bibliographically approved
6. SciPipe - A workflow library for agile development of complex and dynamic bioinformatics pipelines
Open this publication in new window or tab >>SciPipe - A workflow library for agile development of complex and dynamic bioinformatics pipelines
2018 (English)In: Article in journal (Other academic) Epub ahead of print
Abstract [en]

Background: The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation and aid reproducibility of analyses. Many contemporary workflow tools are specialized and not designed for highly complex workflows, such as with nested loops, dynamic scheduling and parametrization, which is common in e.g. machine learning. Findings: SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on Flow-based programming principles to support agile development of workflows based on a library of self-contained, re-usable components. It supports running subsets of workflows for improved iterative development, and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX and PDF on-demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline. Conclusions: SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine leaning, through a flexible programming API suitable for scientists used to programming or scripting.

Keywords
Scientific Workflow Management Systems, Workflow tools, Workflows, Pipelines, Reproducibility, Machine Learning, Flow-based Programming, Go, Golang
National Category
Bioinformatics (Computational Biology)
Research subject
Bioinformatics
Identifiers
urn:nbn:se:uu:diva-358347 (URN)10.1101/380808 (DOI)
Funder
eSSENCE - An eScience CollaborationSwedish e‐Science Research CenterEU, Horizon 2020, 654241
Available from: 2018-08-27 Created: 2018-08-27 Last updated: 2018-08-28Bibliographically approved

Open Access in DiVA

fulltext(1480 kB)188 downloads
File information
File name FULLTEXT01.pdfFile size 1480 kBChecksum SHA-512
073a9d410832d2d6db68349c01294b647c35b7becd6741ef4811fde2f8b543307ee0d99ecce703de5712035c4dd8d82321f7f3c0515ebd2b1f5f8543d30c4a86
Type fulltextMimetype application/pdf
Buy this publication >>

Authority records BETA

Lampa, Samuel

Search in DiVA

By author/editor
Lampa, Samuel
By organisation
Department of Pharmaceutical Biosciences
Pharmacology and ToxicologyBioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 188 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 710 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf