Show simple item record

dc.contributor.advisorHide, Winston
dc.contributor.authorPtitsyn, Andrey
dc.date.accessioned2022-03-09T07:09:34Z
dc.date.available2022-03-09T07:09:34Z
dc.date.issued2000
dc.identifier.urihttp://hdl.handle.net/11394/8845
dc.descriptionPhilosophiae Doctor - PhDen_US
dc.description.abstractExpressed sequence tag database is a rich and fast growing source of data for gene expression analysis and drug discovery. Clustering of raw EST data is a necessary step for further analysis and one of the most challenging problems of modem computational biology. There are a few systems, designed for this purpose and a few more are currently under development. These systems are reviewed in the "Literature and software review". Different strategies of supervised and unsupervised clustering are discussed, as well as sequence comparison techniques, such as based on alignment or oligonucleotide compositions. Analysis of potential bottlenecks and estimation of computation complexity of EST clustering is done in Chapter 2. This chapter also states the goals for the research and justifies the need for new algorithm that has to be fast, but still sensitive to relatively short (40 bp) regions of local similarity. A new sequence comparison algorithm is developed and described in Chapter 3. This algorithm has a linear computation complexity and sufficient sensitivity to detect short regions of local similarity between nucleotide sequences. The algorithm utilizes an asymmetric approach, when one of the compared sequences is presented in a form of oligonucleotide table, while the second sequence is in standard, linear form. A short window is moved along the linear sequence and all overlapping oligonucleotides of a constant length in the frame are compared for the oligonucleotide table. The result of comparison of two sequences is a single figure, which can be compared to a threshold. For each measure of sequence similarity a probability of false positive and false negative can be estimated. The algorithm was set up and implemented to recognize matching ESTs with overlapping regions of 40bp with 95% identity, which is better than resolution ability of contemporary EST clustering tools This algorithm was used as a sequence comparison engine for two EST clustering programs, described in Chapter 4. These programs implement two different strategies: stringent and loose clustering. Both are tested on small, but realistic benchmark data sets and show the results, similar to one of the best existing clustering programs, 02_cluster, but with a significant advantage in speed and sensitivity to small overlapping regions of ESTs. On three different CPUs the new algorithm run at least two times faster, leaving less singletons and producing bigger clusters. With parallel optimization this algorithm is capable of clustering millions of ESTs on relatively inexpensive computers. The loose clustering variant is a highly portable application, relying on third-party software for cluster assembly. It was built to the same specifications as 02_ cluster and can be immediately included into the STACKPack package for EST clustering. The stringent clustering program produces already assembled clusters and can apprehend alternatively processed variants during the clustering process.en_US
dc.language.isoenen_US
dc.publisherUniversity of the Western Capeen_US
dc.subjectTentative Human Consensus (THC)en_US
dc.subjectExpression Analysis Tool (VEAT)en_US
dc.subjectExpressed Gene Anatomy Database (EGAD)en_US
dc.subjectReceiver Operator Characteristic (ROC)en_US
dc.subjectNucleotideen_US
dc.subjectOligonucleotidesen_US
dc.titleNew Algorithms for EST clusteringen_US
dc.rights.holderUniversity of the Western Capeen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record