New Algorithms for EST clustering

Ptitsyn, Andrey

dc.contributor.advisor	Hide, Winston
dc.contributor.author	Ptitsyn, Andrey
dc.date.accessioned	2022-03-09T07:09:34Z
dc.date.available	2022-03-09T07:09:34Z
dc.date.issued	2000
dc.identifier.uri	http://hdl.handle.net/11394/8845
dc.description	Philosophiae Doctor - PhD	en_US
dc.description.abstract	Expressed sequence tag database is a rich and fast growing source of data for gene expression analysis and drug discovery. Clustering of raw EST data is a necessary step for further analysis and one of the most challenging problems of modem computational biology. There are a few systems, designed for this purpose and a few more are currently under development. These systems are reviewed in the "Literature and software review". Different strategies of supervised and unsupervised clustering are discussed, as well as sequence comparison techniques, such as based on alignment or oligonucleotide compositions. Analysis of potential bottlenecks and estimation of computation complexity of EST clustering is done in Chapter 2. This chapter also states the goals for the research and justifies the need for new algorithm that has to be fast, but still sensitive to relatively short (40 bp) regions of local similarity. A new sequence comparison algorithm is developed and described in Chapter 3. This algorithm has a linear computation complexity and sufficient sensitivity to detect short regions of local similarity between nucleotide sequences. The algorithm utilizes an asymmetric approach, when one of the compared sequences is presented in a form of oligonucleotide table, while the second sequence is in standard, linear form. A short window is moved along the linear sequence and all overlapping oligonucleotides of a constant length in the frame are compared for the oligonucleotide table. The result of comparison of two sequences is a single figure, which can be compared to a threshold. For each measure of sequence similarity a probability of false positive and false negative can be estimated. The algorithm was set up and implemented to recognize matching ESTs with overlapping regions of 40bp with 95% identity, which is better than resolution ability of contemporary EST clustering tools This algorithm was used as a sequence comparison engine for two EST clustering programs, described in Chapter 4. These programs implement two different strategies: stringent and loose clustering. Both are tested on small, but realistic benchmark data sets and show the results, similar to one of the best existing clustering programs, 02_cluster, but with a significant advantage in speed and sensitivity to small overlapping regions of ESTs. On three different CPUs the new algorithm run at least two times faster, leaving less singletons and producing bigger clusters. With parallel optimization this algorithm is capable of clustering millions of ESTs on relatively inexpensive computers. The loose clustering variant is a highly portable application, relying on third-party software for cluster assembly. It was built to the same specifications as 02_ cluster and can be immediately included into the STACKPack package for EST clustering. The stringent clustering program produces already assembled clusters and can apprehend alternatively processed variants during the clustering process.	en_US
dc.language.iso	en	en_US
dc.publisher	University of the Western Cape	en_US
dc.subject	Tentative Human Consensus (THC)	en_US
dc.subject	Expression Analysis Tool (VEAT)	en_US
dc.subject	Expressed Gene Anatomy Database (EGAD)	en_US
dc.subject	Receiver Operator Characteristic (ROC)	en_US
dc.subject	Nucleotide	en_US
dc.subject	Oligonucleotides	en_US
dc.title	New Algorithms for EST clustering	en_US
dc.rights.holder	University of the Western Cape	en_US

Files in this item

Name:: Ptitsyn_PhD_NSC_2000.pdf
Size:: 47.36Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Philosophiae Doctor - PhD (Microbiology)

Show simple item record