New Algorithms for EST clustering
Abstract
Expressed sequence tag database is a rich and fast growing source of data for gene expression analysis and drug discovery. Clustering of raw EST data is a necessary step for further analysis and one of the most challenging problems of modem computational biology. There are a few systems, designed for this purpose and a few more are currently under development. These systems are reviewed in the "Literature
and software review". Different strategies of supervised and unsupervised clustering are discussed, as well as sequence comparison techniques, such as based on alignment or oligonucleotide compositions. Analysis of potential bottlenecks and estimation of computation complexity of EST clustering is done in Chapter 2. This chapter also states the goals for the research and justifies the need for new algorithm that has to be fast, but still sensitive to relatively short (40 bp) regions of local similarity. A new sequence comparison algorithm is developed and described in Chapter 3. This algorithm has a linear computation complexity and sufficient sensitivity to detect short regions of local similarity between nucleotide sequences. The algorithm utilizes an asymmetric approach, when one of the compared sequences is presented in a form of oligonucleotide table, while the second sequence is in standard, linear form. A short window is moved along the linear sequence and all overlapping oligonucleotides of a constant length in the frame are compared for the oligonucleotide table. The result of comparison of two sequences is a single figure, which can be compared to a threshold. For each measure of sequence similarity a probability of false positive and false negative can be estimated. The algorithm was set up and implemented to recognize matching ESTs with overlapping regions of 40bp with 95% identity, which is better than resolution ability of contemporary EST clustering tools This algorithm was used as a sequence comparison engine for two EST clustering programs, described in Chapter 4. These programs implement two different strategies:
stringent and loose clustering. Both are tested on small, but realistic benchmark data sets and show the results, similar to one of the best existing clustering programs, 02_cluster, but with a significant advantage in speed and sensitivity to small overlapping regions of ESTs. On three different CPUs the new algorithm run at least two times faster, leaving less singletons and producing bigger clusters. With parallel
optimization this algorithm is capable of clustering millions of ESTs on relatively inexpensive computers. The loose clustering variant is a highly portable application, relying on third-party software for cluster assembly. It was built to the same specifications as 02_ cluster and can be immediately included into the STACKPack package for EST clustering. The stringent clustering program produces already
assembled clusters and can apprehend alternatively processed variants during the clustering process.