Genome assembly of next-generation sequencing data for the Oryx bacillus : species of the Mycobacterium tuberculosis complex
Next generation sequencing (NGS) technology platforms have accelerated ability to produce completed genome assemblies. Recently, collaborators at Tygerberg Medical School outsourced the sequencing of Oryx bacillus, a member of the Mycobacterium tuberculosis complex (MTC). A total of 31,271,059 short reads were generated and required filtering, assembly and annotation using bioinformatics algorithms. In this project, an NGS assembly pipeline was implemented, tailored specifically for SOLiD sequence data. The raw reads were aligned to seven fully sequenced and annotated MTC members, namely, Mycobacterium tuberculosis H37Rv, H37Ra, CDC1551, F11, KZN 1435, Mycobacterium bovis AF2122/97 and Mycobacterium bovis BCG str. Pasteur 1173P2 using NovoalignCS. Depth and breadth of sequence coverage across each base of the reference genome was calculated using BEDTools, and structural variation. Structural variation at the nucleotide level including deletions, insertions and single nucleotidepolymorphisms (SNPs) were called using three tools, GATK, SAMtools and Nesoni. These variations were further filtered using in-house PERL scripts. Putative functional roles for the alterations at the DNA level were extrapolated from the overlap with essential genes present in annotated MTC members. Approximately 20,730,631 short reads (59.78%) out of a total of 31,271,059 reads aligned to the seven reference genomes. The per base sequence coverage calculations revealed an average of 1,243 unaligned regions. These unaligned regions overlapped with mycobacterial regions of difference (RD) and genetic phage elements acquired by the MTC through horizontal gene transfer and are genes prevalent in the clinical isolates of M. tuberculosis. A total of 2,680 genetic variations were identified and categorised into 845 synonymous and 1,724 non-synonymous SNPs together with 44 insertions and 67 deletions. Some of the variant alleles overlapped known genes to be involved in TB drug resistance. While the biological significance of our findings remain to be elucidated, it nonetheless deserves further attention, because SNPs have the potential to impact on strain phenotype by gene disruption. Therefore, any hypotheses generated from these large-scale analyses will be tested by our collaborators at Tygerberg medical school.