uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Separation of Nearly Identical Repeats in Shotgun Assemblies using Defined Nucloetide Positions, DNPs
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Genetics and Pathology. (Rudbeck Laboratory)
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Genetics and Pathology. (Rudbeck Laboratory)
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Mathematics.
Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Genetics and Pathology. (Rudbeck Laboratory)
2002 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 18, no 3, 379-388 p.Article in journal (Refereed) Published
Abstract [en]

An increasingly important problem in genome sequencing is the failure of the commonly used shotgun assembly programs to correctly assemble repetitive sequences. The assembly of non-repetitive regions or regions containing repeats considerably shorter than the average read length is in practice easy to solve, while longer repeats have been a difficult problem. We here present a statistical method to separate arbitrarily long, almost identical repeats, which makes it possible to correctly assemble complex repetitive sequence regions. The differences between repeat units may be as low as 1% and the sequencing error may be up to ten times higher. The method is based on the realization that a comparison of only a part of all overlapping sequences at a time in a data set does not generate enough information for a conclusive analysis. Our method uses optimal multi-alignments consisting of all the overlaps of each read. This makes it possible to determine defined nucleotide positions, DNPs, which constitute the differences between the repeat units. Differences between repeats are distinguished from sequencing errors using statistical methods, where the probabilities of obtaining certain combinations of candidate DNPs are calculated using the information from the multi-alignments. The use of DNPs and combinations of DNPs will allow for optimal and rapid assemblies of repeated regions. This method can solve repeats that differ in only two positions in a read length, which is the theoretical limit for repeat separation. We predict that this method will be highly useful in shotgun sequencing in the future.

Place, publisher, year, edition, pages
2002. Vol. 18, no 3, 379-388 p.
National Category
Medical and Health Sciences
Identifiers
URN: urn:nbn:se:uu:diva-89933DOI: 10.1093/bioinformatics/18.3.379PubMedID: 11934736OAI: oai:DiVA.org:uu-89933DiVA: diva2:161816
Available from: 2002-05-23 Created: 2002-05-23 Last updated: 2017-12-14Bibliographically approved
In thesis
1. Software Tools and Algorithms for Shotgun Sequence Assembly
Open this publication in new window or tab >>Software Tools and Algorithms for Shotgun Sequence Assembly
2002 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

During the last ten years, a genomics revolution has changed the ways biological research is carried out. The draft sequence of the human genome is available, as well as the sequence of 84 other completed genomes. High-throughput genomics technologies such as genome sequencing with associated bioinformatics tools have been instrumental in this process. The draft genome sequences were determined using the shotgun sequencing strategy, where long DNA molecules are randomly sheared into small pieces from which sequences are determined. These are assembled by computer programs in order to reconstruct the original genome sequence. Ubiquitous repeated sequences together with errors in the sequencing process complicate the assembly of shotgun fragments. In most genome projects gaps are caused by this complication.

This thesis presents methods and algorithms to separate repeated sequences in shotgun projects. The Tandem Repeat Assembly Program (TRAP) builds multiple alignments of reads, which are then analyzed in order to discriminate sequencing errors from real differences between highly similar repeats. The method is based on the fact that sequencing errors are randomly distributed, as opposed to the systematic distribution of mutations in repeat copies. The TRAP assembler was shown to be able to correctly assemble 2000 bp repeat copies that are repeated in tandem eight times. The degree of difference between repeat copies was 1.0%, and the maximum sequencing error 11%.

A refined method based on single base differences between repeat copies has been developed to further improve repeat separation. Results show that in the same sequence, 87% of all the single base differences present in the repeats can be detected, with an error of only 1.6%.

In addition, a novel pattern-matching algorithm was developed. This algorithm takes advantage of the inherent symmetry between indices that can be computed for similar words of the same length and was implemented in the error correction software, MisEd. The results show that up to 99.3% of the sequencing errors can be corrected, while up to 87% of the single base differences remain.

All methods described have thus been shown to be functional, and it is clear that these programs will facilitate genome sequencing and assembly.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2002. 55 p.
Series
Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine, ISSN 0282-7476 ; 1169
Keyword
Genetics, Shotgun sequencing, multiple alignment, fragment assembly, repeats, sequencing error, Genetik
National Category
Medical Genetics
Research subject
Medical Genetics
Identifiers
urn:nbn:se:uu:diva-2176 (URN)91-554-5361-9 (ISBN)
Public defence
2002-09-06, Rudbecksalen, Uppsala, 13:15
Opponent
Available from: 2002-05-23 Created: 2002-05-23Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textPubMed
By organisation
Department of Genetics and PathologyDepartment of Mathematics
In the same journal
Bioinformatics
Medical and Health Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 467 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf