Imbalanced Data Sampling

Sample Subset Optimization (SSO)

Application to imbalanced data sampling

Summary

Sample subset optimization (SSO) [1] can be used as an intelligent method for sampling from imbalanced data [2]. It works by ranking and selecting most informative samples from the majority class to form a balanced dataset with samples from minority class.

GUI version

The GUI version SSOgui.jar will run by a double-click assuming you have Java installed in your computer. The dataset exampleData.txt could be used to test the program. When using your own dataset, please follow the way each column and row is defined as in the exampleData.txt. For the example dataset, the instances from 31 to 35 are noise introduced to the majority class but generated by minority class distribution. Therefore, the goal is to give them a fairly low rank when sampling is applied to select from the majority class. SSOgui.jar is straightforward to use and self-explanatory. Let me know if you need further detail on how to use it.

Please follow the following four steps to use the SSOgui.jar

HTML5 Icon

SSOgui source code is available from here

Command line version

To try out the command line version, please going through the following steps:

Testable Examples

Download and unzip SSOSampling_v.1.1.zip
Download exampleData.txt. Note that example data is a tab-delimited file and the command line version expect the data in ARFF format. So you will need to run the TAB2ARFF.pl converter first to convert data format. This requires PERl to be installed in your computer.
perl TAB2ARFF.pl exampleData.txt > exampleData.arff
To obtain the general information about the program, issue following command in command line without parameters:
java -jar SSOSampling.jar
To run SSOSampling on example dataset:
java -jar SSOSampling.jar -f example.arff

SSOSampling source code is available from here

Background Knowledge

Data sampling [3] is a useful procedure when the data to be analyzed is of imbalanced class distribution (samples from one class (referred to as majority class) outnumber samples from another class (referred to as minority class)). This is because the classification algorithms (classifiers) may be biased towards the majority class. Random over sampling (randomly increases the minority class so as to match the majority class) and random under sampling (randomly decrease the majority class so as to match the minority class) are the simplest ways to balance the class distribution. The disadvantage of random over sampling is it duplicates samples in the minority class which may be completely ignored by some classifiers. As to random under sampling, it throws information away! Currently, the most popular and intelligent (as oppose to random) over sampling strategy is SMOTE [4] which produce "new" samples instead of directly duplicate samples from minority class. However, there is no (few) good proposal for intelligent under sampling. Here, we propose an intelligent under sampling approach by using sample subset optimization technique. The usefulness of each sample from majority class are evaluated, and the most useful samples from the majority class are selected to combine with samples from the minority class. This result in a balanced under sampling dataset. The following two figures display the decision boundary of a kNN classifier build on an imbalanced dataset without using SSO-based sampling and with SSO-based sampling.

References

[1] Yang, P.^#, Yoo, P., Fernando, J., Zhou, B., Zhang, Z. & Zomaya, A. (2014). Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Transactions on Cybernetics, 44(3), 445-455. [IEEE Xplore] [PDF]

[2] Yang, P.^#, Xu, L., Zhou, B., Zhang, Z. & Zomaya, A. (2009). A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genomics, 10, S34. [Pubmed]

[3] He, H, & Garcia, E. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.

[4] Chawla, N., Bowyer, K., Hall, L. & Kegelmeyer, W. (2002). SMOTE: Synthetic Over-sampling TEchnique. Journal of Artificial Intelligence Research, 16, 341-378.