Computational prediction of host-pathogen protein-protein interactions
Supervised machine learning approaches have been applied successfully to the prediction of protein-protein interactions (PPIs) within a single organism, i.e., intra-species predictions. However, because of the absence of large amounts of experimentally validated PPIs data for training and testing, fewer studies have successfully applied these techniques to host-pathogen PPI, i.e., inter-species comparisons. Among the host-pathogen studies, most of them have focused on human-virus interactions and specifically human-HIV PPI data. Additional improvements to machine learning techniques and feature sets are important to improve the classification accuracy for host-pathogen protein-protein interactions prediction. The primary aim of this bioinformatics thesis was to develop a binary classifier with an appropriate feature set for host-pathogen protein-protein interaction prediction using published human-Hepatitis C virus PPI, and to test the model on available host-pathogen data for human-Bacillus anthracis PPI. Twelve different feature sets were compared to find the optimal set. The feature selection process reveals that our novel quadruple feature (a subsequence of four consecutive amino acid) combined with sequence similarity and human interactome network properties (such as degree, cluster coefficient, and betweenness centrality) were the best set. The optimal feature set outperformed those in the relevant published material, giving 95.9% sensitivity, 91.6% specificity and 89.0% accuracy. Using our optimal features set, we developed a neural network model to predict PPI between human-Mycobacterium tuberculosis. The strategy is to develop a model trained with intra-species PPI data and extend it to inter-species prediction. However, the lack of experimentally validated PPI data between human-Mycobacterium tuberculosis (Mtuberculosis), leads us to first assess the feasibility of using validated intra-species PPI data to build a model for inter-species PPI. In this model we used human intra-species PPI combined with Bacillus anthracis intra-species data to develop a binary classification model and extend the model for human-Bacillus anthracis inter-species prediction. Thus, we test our hypotheses on known human-Bacillus anthracis PPI data and the result shows good performance with 89.0% as average accuracy. The same approach was extended to the prediction of PPI between human-Mycobacterium tuberculosis. The predicted human-M-tuberculosis PPI data were further validated using functional enrichment of experimentally verified secretory proteins in M-tuberculosis, cellular compartment analysis and pathway enrichment analysis. Results show that five of the M-tuberculosis secretory proteins within an infected host macrophage that correspond to the mycobacterial virulent strain H37Rv were extracted from the human-M- tuberculosis PPI dataset predicted by our model. Finally, a web server was created to predict PPIs between human and Mycobacterium tuberculosis which is available online at URL:http://hppredict.sanbi.ac.za. In summary, the concepts, techniques and technologies developed as part of this thesis have the potential to contribute not only to the understanding PPI analysis between human and Mycobacterium tuberculosis, but can be extended to other pathogens. Further materials related to this study are available at ftp://ftp.sanbi.ac.za/machine learning.