Data Science techniques for predicting plant genes involved in secondary metabolites production

Muteba, Ben Ilunga

dc.contributor.advisor	Christoffels, Alan
dc.contributor.author	Muteba, Ben Ilunga
dc.date.accessioned	2019-10-01T10:40:41Z
dc.date.available	2019-10-01T10:40:41Z
dc.date.issued	2018
dc.identifier.uri	http://hdl.handle.net/11394/7039
dc.description	Masters of Science	en_US
dc.description.abstract	Plant genome analysis is currently experiencing a boost due to reduced costs associated with the development of next generation sequencing technologies. Knowledge on genetic background can be applied to guide targeted plant selection and breeding, and to facilitate natural product discovery and biological engineering. In medicinal plants, secondary metabolites are of particular interest because they often represent the main active ingredients associated with health-promoting qualities. Plant polyphenols are a highly diverse family of aromatic secondary metabolites that act as antimicrobial agents, UV protectants, and insect or herbivore repellents. Most of the genome mining tools developed to understand genetic materials have very seldom addressed secondary metabolite genes and biosynthesis pathways. Little significant research has been conducted to study key enzyme factors that can predict a class of secondary metabolite genes from polyketide synthases. The objectives of this study were twofold: Primarily, it aimed to identify the biological properties of secondary metabolite genes and the selection of a specific gene, naringenin-chalcone synthase or chalcone synthase (CHS). The study hypothesized that data science approaches in mining biological data, particularly secondary metabolite genes, would enable the compulsory disclosure of some aspects of secondary metabolite (SM). Secondarily, the aim was to propose a proof of concept for classifying or predicting plant genes involved in polyphenol biosynthesis from data science techniques and convey these techniques in computational analysis through machine learning algorithms and mathematical and statistical approaches. Three specific challenges experienced while analysing secondary metabolite datasets were: 1) class imbalance, which refers to lack of proportionality among protein sequence classes; 2) high dimensionality, which alludes to a phenomenon feature space that arises when analysing bioinformatics datasets; and 3) the difference in protein sequences lengths, which alludes to a phenomenon that protein sequences have different lengths. Considering these inherent issues, developing precise classification models and statistical models proves a challenge. Therefore, the prerequisite for effective SM plant gene mining is dedicated data science techniques that can collect, prepare and analyse SM genes.	en_US
dc.language.iso	en	en_US
dc.publisher	University of the Western Cape	en_US
dc.subject	Medicinal plants	en_US
dc.subject	Polyphenols	en_US
dc.subject	Feature selection	en_US
dc.subject	Data visualisation	en_US
dc.subject	Feature engineering	en_US
dc.title	Data Science techniques for predicting plant genes involved in secondary metabolites production	en_US
dc.rights.holder	University of the Western Cape	en_US

Files in this item

Name:: muteba_msc_nsc_2018.pdf
Size:: 8.040Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Magister Scientiae - MSc (Bioinformatics)

Show simple item record