Generation of a human gene index and its application to disease candidacy

Christoffels, Alan

dc.contributor.advisor	Hide, Winston
dc.contributor.author	Christoffels, Alan
dc.contributor.other
dc.contributor.other	Faculty of Science
dc.date.accessioned	2013-08-27T13:21:39Z
dc.date.available	2007/07/26 10:00
dc.date.available	2007/07/26
dc.date.available	2013-08-27T13:21:39Z
dc.date.issued	2001
dc.identifier.uri	http://hdl.handle.net/11394/2002
dc.description	Philosophiae Doctor - PhD	en_US
dc.description.abstract	With easy access to technology to generate expressed sequence tags (ESTs), several groups have sequenced from thousands to several thousands of ESTs. These ESTs benefit from consolidation and organization to deliver significant biological value. A number of EST projects are underway to extract maximum value from fragmented EST resources by constructing gene indices, where all transcripts are partitioned into index classes such that transcripts are put into the same index class if they represent the same gene. Therefore a gene index should ideally represent a non-redundant set of transcripts. Indeed, most gene indices aim to reconstruct the gene complement of a genome and their technological developments are directed at achieving this goal. The South African National Bioinformatics Institute (SANBI), on the other hand, embarked on the development of the sequence alignment and consensus knowledgebase (STACK) database that focused on the detection and visualisation of transcript variation in the context of developmental and pathological states, using all publicly available ESTs. Preliminary work on the STACK project employed an approach of partitioning the EST data into arbitrarily chosen tissue categories as a means of reducing the EST sequences to manageable sizes for subsequent processing. The tissue partitioning provided the template material for developing error-checking tools to analyse the information embedded in the error-laden EST sequences. However, tissue partitioning increases redundancy in the sequence data because one gene can be expressed in multiple tissues, with the result that multiple tissue partitioned transcripts will correspond to the same gene.Therefore, the sequence data represented by each tissue category had to be merged in order to obtain a comprehensive view of expressed transcript variation across all available tissues. The need to consolidate all EST information provided the impetus for developing a STACK human gene index, also referred to as a whole-body index. In this dissertation, I report on the development of a STACK human gene index represented by consensus transcripts where all constituent ESTs sample single or multiple tissues in order to provide the correct development and pathological context for investigating sequence variation. Furthermore, the availability of a human gene index is assessed as a diseasecandidate gene discovery resource. A feasible approach to construction of a whole-body index required the ability to process error-prone EST data in excess of one million sequences (1,198,607 ESTs as of December 1998). In the absence of new clustering algorithms, at that time, we successfully ported D2_CLUSTER, an EST clustering algorithm, to the high performance shared multiprocessor machine, Origin2000. Improvements to the parallelised version of D2_CLUSTER included: (i) ability to cluster sequences on as many as 126 processors. For example, 462000 ESTs were clustered in 31 hours on 126 R10000 MHz processors, Origin2000. (ii) enhanced memory management that allowed for clustering of mRNA sequences as long as 83000 base pairs. (iii) ability to have the input sequence data accessible to all processors, allowing rapid access to the sequences. (iv) a restart module that allowed a job to be restarted if it was interrupted. The successful enhancements to the parallelised version of D2_CLUSTER, as listed above, allowed for the processing of EST datasets in excess of 1 million sequences. An hierarchical approach was adopted where 1,198,607 million ESTs from GenBank release 110 (October 1998) were partitioned into "tissue bins" and each tissue bin was processed through a pipeline that included masking for contaminants, clustering, assembly, assembly analysis and consensus generation. A total of 478,707 consensus transcripts were generated for all the tissue categories and these sequences served as the input data for the generation of the wholebody index sequences. The clustering of all tissue-derived consensus transcripts was followed by the collapse of each consensus sequence to its individual ESTs prior to assembly and whole-body index consensus sequence generation. The hierarchical approach demonstrated a consolidation of the input EST data from 1,198607 ESTs to 69,158 multi-sequence clusters and 162,439 singletons (or individual ESTs). Chromosomal locations were added to 25,793 whole-body index sequences through assignment of genetic markers such as radiation hybrid markers and généthon markers. The whole-body index sequences were made available to the research community through a sequence-based search engine (http://ziggy.sanbi.ac.za/~alan/researchINDEX.html).	en_US
dc.language.iso	en	en_US
dc.publisher	University of the Western Cape	en_US
dc.subject	Human genetics	en_US
dc.subject	Human genome	en_US
dc.subject	Human gene mapping	en_US
dc.subject	Human chromosomes	en_US
dc.subject	Human gene libraries	en_US
dc.subject	Gene librarie	en_US
dc.title	Generation of a human gene index and its application to disease candidacy	en_US
dc.type	Thesis	en_US
dc.rights.holder	University of the Western Cape	en_US
dc.description.country	South Africa

Files in this item

Name:: Christoffels_PHD_2001.pdf
Size:: 1.268Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Philosophiae Doctor - PhD (Bioinformatics)

Show simple item record

Generation of a human gene index and its application to disease candidacy

Files in this item

This item appears in the following Collection(s)

Related items

The national implementation of international human rights law pertaining to children with disabilities in selected jurisdictions in Africa ﻿

Investigation and prosecution of transnational women trafficking: the case of Ethiopia ﻿

The viability of the South African National Development Plan and Amartya Sen's theory of ethical development ﻿

The national implementation of international human rights law pertaining to children with disabilities in selected jurisdictions in Africa

Investigation and prosecution of transnational women trafficking: the case of Ethiopia

The viability of the South African National Development Plan and Amartya Sen's theory of ethical development