On the efficacy of enhanced feature selection methods for supervised crime prediction
Abstract
The challenge of crime across the globe has necessitated several considerations for crime preventive measures. There exist a variety of crime prevention strategies, such as the use of necessary weapons or tools to respond to crime. However, for resource-constrained nations such as South Africa, where the current police to civilian ratio is overwhelming, this may not suffice. Consequently, crime continues to be on the rise, necessitating alternative prevention strategies. Among alternative prevention approaches, the use of historical crime data can be explored through machine learning. Crime prediction using machine learning has been explored and has shown promising results. However, the choice of algorithm and feature selection methods play a critical role in creating an effective predictive model. This study, therefore, explores the efficacy of enhanced feature selection methods in supervised machine learning algorithms for crime prediction. Four (4) baseline algorithms are adopted, which are Random Forest (RF), Extremely Randomized Trees (ERT), Na¨ıve Bayes (NB), and Support Vector Machine (SVM). This research further proposes three algorithms, with the first derived from hybridizing RF and ERT (RF-Plus), while the other two (2) were obtained from enhancing NB and SVM using recursive feature elimination (RFE), obtaining (RFE-NB) and (RFE-SVM) respectively, totaling seven algorithms. Finally, a comparative evaluation of these algorithms with their respective baselines is conducted to report on their efficacy and contrasted against additional two (2) algorithms from the literature, which amounts to a total of nine (9) algorithms. The study conducted performance evaluation on the models using two distinct publicly available datasets, which are the Chicago and Los Angeles crime datasets. Results confirm that feature selection positively impacts prediction accuracy. The enhancement on the pure NB improved its accuracy from 72.5% to 96.6% and 80.45% to 95.78% for Chicago and Los Angeles datasets, respectively. The enhancement improved the accuracy of pure SVM from 74.73% to 89.91% and 75.73% to 88.70% for the Chicago and Los Angeles datasets, respectively, while achieving 97.04% and 95.5% on RF-Plus for both Chicago and Los Angeles datasets, respectively.