In silico investigation of glossina morsitans promoters
Tsetse flies (Glossina spp) are the biological vectors for Trypanosomes, the causative magents of Human African Trypanosomiasis (HAT). HAT is a debilitating disease that continues to present a major public health problem and a key factor limiting rural development in vast regions of tropical Africa. To augment vector control efforts, the International Glossina Genome Initiative (IGGI) was established in 2004 with the ultimate goal of generating a fully annotated whole genome sequence for Glossina morsitans. A working draft genome of Glossina morsitans was availed in 2011. In this thesis, transcriptional regulatory features in Glossina morsitans were analysed using the draft genome. A method for TSS identification in the newly sequenced Glossina morsitans genome was developed using TSS-seq tags sampled from two developmental stages of Glossina morsitans. High throughput next generation sequencing reads obtained from Glossina morsitans larvae and pupae were used to locate transcription start sites (TSS) in the Glossina morsitans genome. TSS-seq tag clusters, defined as a minimum number of reads at the 5’ predicted UTR or first coding exon, were used to define transcription start sites. A total of 3134 tag clusters were identified on the Glossina genome. Approximately 45.4% (1424) of the tag clusters mapped to the first coding exons or their proximal predicted 5’UTR regions and include 31 tag clusters that mapped to transposons. A total of 1101 (35.1%) tag clusters mapped outside the genic region and/or scaffolds without gene predictions and may correspond to previously un-annotated transcripts or noncoding RNA TSS. The core promoter regions were classified as narrow or broad based on the number of TSS positions within a TSS-seq cluster. Majority (95%) of the core promoters analysed in this study were of the broad type while only 5% were of the narrow type. Comparison of canonical core promoter motif occurences between random and bona fide core promoters showed that, generally, the number of motifs in biologically functional genomic windows in the true dataset exceeded those in the random dataset (p <= 0.00164, 0.00135, 0.00185 for the narrow, broad with peak and broad without peak categories respectively). Frequency of motif co-occurrence in core promoter was found to be fundamentally different across various initiation patterns. Narrow core promoters recorded higher frequency of the TATA-box and INR motifs and two-way motif co-occurrence showed that the TATA-box-INR pair is over-represented in the narrow category. Broad core promoters showed higher frequency of the BREd and MTE motifs and two-way motif co-occurrence showed that the MTE-DPE pair is over-represented in broad core promoters. TATA-less promoters account for 77% of the core promoters in this analysis. TATA-less core promoters showed a higher frequency of the MTE and INR motifs in contrast to observations in Drosophila where the DPE motif has been reported to occur frequently in TATA-less promoters. These motif combinations suggest their equal importance to transcription in their corresponding promoter classes in Glossina morsitans.