uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Software Tools and Algorithms for Shotgun Sequence Assembly
Uppsala University, Medicinska vetenskapsområdet, Faculty of Medicine, Department of Genetics and Pathology.
2002 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

During the last ten years, a genomics revolution has changed the ways biological research is carried out. The draft sequence of the human genome is available, as well as the sequence of 84 other completed genomes. High-throughput genomics technologies such as genome sequencing with associated bioinformatics tools have been instrumental in this process. The draft genome sequences were determined using the shotgun sequencing strategy, where long DNA molecules are randomly sheared into small pieces from which sequences are determined. These are assembled by computer programs in order to reconstruct the original genome sequence. Ubiquitous repeated sequences together with errors in the sequencing process complicate the assembly of shotgun fragments. In most genome projects gaps are caused by this complication.

This thesis presents methods and algorithms to separate repeated sequences in shotgun projects. The Tandem Repeat Assembly Program (TRAP) builds multiple alignments of reads, which are then analyzed in order to discriminate sequencing errors from real differences between highly similar repeats. The method is based on the fact that sequencing errors are randomly distributed, as opposed to the systematic distribution of mutations in repeat copies. The TRAP assembler was shown to be able to correctly assemble 2000 bp repeat copies that are repeated in tandem eight times. The degree of difference between repeat copies was 1.0%, and the maximum sequencing error 11%.

A refined method based on single base differences between repeat copies has been developed to further improve repeat separation. Results show that in the same sequence, 87% of all the single base differences present in the repeats can be detected, with an error of only 1.6%.

In addition, a novel pattern-matching algorithm was developed. This algorithm takes advantage of the inherent symmetry between indices that can be computed for similar words of the same length and was implemented in the error correction software, MisEd. The results show that up to 99.3% of the sequencing errors can be corrected, while up to 87% of the single base differences remain.

All methods described have thus been shown to be functional, and it is clear that these programs will facilitate genome sequencing and assembly.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis , 2002. , 55 p.
Series
Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine, ISSN 0282-7476 ; 1169
Keyword [en]
Genetics, Shotgun sequencing, multiple alignment, fragment assembly, repeats, sequencing error
Keyword [sv]
Genetik
National Category
Medical Genetics
Research subject
Medical Genetics
Identifiers
URN: urn:nbn:se:uu:diva-2176ISBN: 91-554-5361-9 (print)OAI: oai:DiVA.org:uu-2176DiVA: diva2:161818
Public defence
2002-09-06, Rudbecksalen, Uppsala, 13:15
Opponent
Available from: 2002-05-23 Created: 2002-05-23Bibliographically approved
List of papers
1. TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences
Open this publication in new window or tab >>TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences
2003 (English)In: Computer Methods and Programs in Biomedicine, ISSN 0169-2607, E-ISSN 1872-7565, Vol. 70, no 1, 47-59 p.Article in journal (Refereed) Published
Abstract [en]

The software commonly used for assembly of shotgun sequence data has several limitations. One such limitation becomes obvious when repetitive sequences are encountered. Shotgun assembly is a difficult task, even for non-repetitive regions, but the use of quality assessments of the data and efficient matching algorithms have made it possible to assemble most sequences efficiently. In the case of highly repetitive sequences, however, these algorithms fail to distinguish between sequencing errors and single base differences in regions containing nearly identical repeats. None of the currently available fragment assembly programs are able to correctly assemble highly similar repetitive data, and we, therefore, present a novel shotgun assembly program, Tandem Repeat Assembly Program (TRAP). The main feature of this program is the ability to separate long repetitive regions from each other by distinguishing single base substitutions as well as insertions/deletions from sequencing errors. This is accomplished by using a novel multiple-alignment based analysis method. Since repeats are a common complication in most sequencing projects, this software should be of use for the whole sequencing community.

National Category
Medical and Health Sciences
Identifiers
urn:nbn:se:uu:diva-89932 (URN)10.1016/S0169-2607(01)00194-8 (DOI)12468126 (PubMedID)
Available from: 2002-05-23 Created: 2002-05-23 Last updated: 2013-07-24Bibliographically approved
2. Separation of Nearly Identical Repeats in Shotgun Assemblies using Defined Nucloetide Positions, DNPs
Open this publication in new window or tab >>Separation of Nearly Identical Repeats in Shotgun Assemblies using Defined Nucloetide Positions, DNPs
2002 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1460-2059, Vol. 18, no 3, 379-388 p.Article in journal (Refereed) Published
Abstract [en]

An increasingly important problem in genome sequencing is the failure of the commonly used shotgun assembly programs to correctly assemble repetitive sequences. The assembly of non-repetitive regions or regions containing repeats considerably shorter than the average read length is in practice easy to solve, while longer repeats have been a difficult problem. We here present a statistical method to separate arbitrarily long, almost identical repeats, which makes it possible to correctly assemble complex repetitive sequence regions. The differences between repeat units may be as low as 1% and the sequencing error may be up to ten times higher. The method is based on the realization that a comparison of only a part of all overlapping sequences at a time in a data set does not generate enough information for a conclusive analysis. Our method uses optimal multi-alignments consisting of all the overlaps of each read. This makes it possible to determine defined nucleotide positions, DNPs, which constitute the differences between the repeat units. Differences between repeats are distinguished from sequencing errors using statistical methods, where the probabilities of obtaining certain combinations of candidate DNPs are calculated using the information from the multi-alignments. The use of DNPs and combinations of DNPs will allow for optimal and rapid assemblies of repeated regions. This method can solve repeats that differ in only two positions in a read length, which is the theoretical limit for repeat separation. We predict that this method will be highly useful in shotgun sequencing in the future.

National Category
Medical and Health Sciences
Identifiers
urn:nbn:se:uu:diva-89933 (URN)10.1093/bioinformatics/18.3.379 (DOI)11934736 (PubMedID)
Available from: 2002-05-23 Created: 2002-05-23 Last updated: 2013-06-12Bibliographically approved
3. Correcting Errors in Shotgun Sequences Using DNPs and a Novel Pattern Matching Algorithm
Open this publication in new window or tab >>Correcting Errors in Shotgun Sequences Using DNPs and a Novel Pattern Matching Algorithm
In: BioinformaticsArticle in journal (Refereed) Submitted
Identifiers
urn:nbn:se:uu:diva-89934 (URN)
Available from: 2002-05-23 Created: 2002-05-23Bibliographically approved

Open Access in DiVA

No full text
Buy this publication >>

By organisation
Department of Genetics and Pathology
Medical Genetics

Search outside of DiVA

GoogleGoogle Scholar

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 460 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf