Hybrid Cluster-Based Sampling Technique For Class Imbalance Problems

ABSTRACT

Class imbalance problem is prevalent in many real-world domains and as such has become an area of increasing interest for many researchers. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes traditional classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling is among the most commonly used techniques used to deal with this problem. It involves the manipulation of the training data before applying standard classification techniques. This study presents a new hybrid sampling technique that has the capability of improving the overall performance of a wide range of traditional machine learning algorithms. The proposed method uses an undersampling technique based on CUST to under-sample majority instances and an oversampling technique derived from SNOCC to oversample minority instances. The method is implemented in Python version 3.5 on windows. The performance was evaluated using classification algorithms from scikit-learn machine learning library, namely: KNN, SVM, Decision Tree, Random Forest, Neural Network, AdaBoost, Naïve Bayes, and Quadratic Discriminant Analysis. Eleven datasets with various degrees of imbalance were used. The performance of each classifier when the proposed technique is used is compared with the performance when no sampling is performed. In addition to that, the performance of eight (8) other sampling techniques is compared with that of the proposed method. These techniques include ROS, RUS, SMOTE, ADASYN, CUST, SBC, CLUS, and OSS. The experimental results showed that HCBST performed better with most of the classifiers in terms of AUC, G-Mean, and MCC. The overall average performance also showed that HCBT performed better in most of the datasets, having the highest average scores of 0.73, 0.67 and 0.35 in AUC, G-Mean and MCC respectively across all the classifiers used for this study. Extensive testing of machine learning algorithms and their performance metrics yielded promising results. A Graphical User Interface (GUI) to enable interactivity for machine learning with class imbalanced data task operations was incorporated to allow flexibility in the choice of algorithms for certain datasets for higher accuracy.