uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A graph-based approach for improving the homologyinference in multiple sequence alignments
Uppsala University, Disciplinary Domain of Science and Technology, Biology, Department of Ecology and Genetics, Evolutionary Biology. (Whelan Lab)
Uppsala University, Disciplinary Domain of Science and Technology, Biology, Department of Ecology and Genetics, Evolutionary Biology. (Whelan Lab)ORCID iD: 0000-0003-3056-3173
Uppsala University, Disciplinary Domain of Science and Technology, Biology, Department of Ecology and Genetics, Evolutionary Biology.
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Multiple sequence alignment (MSA) is ubiquitous in evolutionary studies and other areas ofbioinformatics. In nearly all cases MSAs are taken to be a known and xed quantity on which toperform downstream analysis despite extensive evidence that MSA accuracy and uncertainty aectsresults. Mistakes in the MSA are known to cause a wide range of problems for downstream evolutionaryinference, ranging from false inference of positive selection to long branch attraction artifacts. The mostpopular approach to dealing with this problem is to remove (lter) specic columns in the MSA thatare thought to be prone to error, either through proximity to gaps or through some scoring function.Although popular, this approach has had mixed success and several studies have even suggested thatltering might be detrimental to phylogenetic studies. Here we present a dierent approach to dealingwith MSA accuracy and uncertainty through a graph-based approach implemented in the freely availablesoftware Divvier. The aim of Divvier is to identify clusters of characters that have strong statisticalevidence of shared homology, based on the output of a pair hidden Markov model. These clusters canthen be used to either lter characters out the MSA, through a process we call partial ltering, or torepresent each of the clusters in a new column, through a process we call divvying up. We validateour approach through its performance on real and simulated benchmarks, nding Divvier substantiallyoutperforms all other ltering software for treating MSAs by retaining more true positive homology callsand removing more false positive homology calls. We also nd that Divvier, in contrast to other lteringtools, can alleviate long branch attraction artifacts induced by MSA and reduces the variation in treeestimates caused by MSA uncertainty.

National Category
Evolutionary Biology
Identifiers
URN: urn:nbn:se:uu:diva-360839OAI: oai:DiVA.org:uu-360839DiVA, id: diva2:1249291
Available from: 2018-09-18 Created: 2018-09-18 Last updated: 2018-09-21
In thesis
1. Evolutionary Approaches to Sequence Alignment
Open this publication in new window or tab >>Evolutionary Approaches to Sequence Alignment
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Molecular evolutionary biology allows us to look into the past by analyzing sequences of amino acids or nucleotides. These analyses can be very complex, often involving advanced statistical models of sequence evolution to construct phylogenetic trees, study the patterns of natural selection and perform a number of other evolutionary studies. In many cases, these evolutionary studies require a prerequisite of multiple sequence alignment (MSA) - a technique, which aims at grouping the characters that share a common ancestor, or homology, into columns. This information regarding shared homology is needed by statistical models to describe the process of substitutions in order to perform evolutionary inference. Sequence alignment, however, is difficult and MSAs often contain whole regions of wrongly aligned characters, which impact downstream analyses.

In this thesis I use two broad groups of approaches to avoid errors in the alignment. The first group addresses the analysis methods without sequence alignment by explicitly modelling the processes of substitutions, and insertions and deletions (indels) between pairs of sequences using pair hidden Markov models. I describe an accurate tree inference method that uses a neighbor joining clustering approach to construct a tree from a matrix of model-based evolutionary distances.

Next, I develop a pairwise method of modelling how natural selection acts on substitutions and indels. I further show the relationship between the constraints acting on these two evolutionary forces to show that natural selection affects them in a similar way.

The second group of approaches deals with errors in existing alignments. I use a statistical model-based approach to evaluate the quality of multiple sequence alignments.

First, I provide a graph-based tool for removing wrongly aligned pairs of residues by splitting them apart. This approach tends to produce better results when compared to standard column-based filtering.

Second, I provide a way to compare MSAs using a probabilistic framework. I propose new ways of scoring of sequence alignments and show that popular methods produce similar results.

The overall purpose of this work is to facilitate more accurate evolutionary analyses by addressing the problem of sequence alignment in a statistically rigorous manner.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2018. p. 57
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1723
Keywords
molecular evolution, multiple sequence alignment, pair hidden Markov models
National Category
Evolutionary Biology
Research subject
Biology with specialization in Evolutionary Genetics
Identifiers
urn:nbn:se:uu:diva-360871 (URN)978-91-513-0445-8 (ISBN)
Public defence
2018-11-09, Ekmansalen, EBC, Norrbyvägen 14, Uppsala, 09:00 (English)
Opponent
Supervisors
Available from: 2018-10-17 Created: 2018-09-19 Last updated: 2018-11-19

Open Access in DiVA

No full text in DiVA

Authority records BETA

Whelan, Simon

Search in DiVA

By author/editor
Bogusz, MarcinWhelan, Simon
By organisation
Evolutionary Biology
Evolutionary Biology

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 1033 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf