Sample subset optimization (SSO) [1]
can be used as an intelligent method for sampling from imbalanced data
[2].
It works by ranking and selecting most informative samples from the majority class to form a balanced dataset
with samples from minority class.
GUI version
The GUI version SSOgui.jar will run by a
double-click assuming you have Java installed in your computer.
The dataset exampleData.txt could be used to test the program.
When using your own dataset, please follow the way each column and row is defined as in the
exampleData.txt.
For the example dataset, the instances from 31 to 35 are noise introduced to the majority class but
generated by minority class distribution. Therefore, the goal is to give them a fairly low rank when sampling
is applied to select from the majority class. SSOgui.jar is straightforward to use and self-explanatory.
Let me know if you need further detail on how to use it.
Download exampleData.txt. Note that example data is a
tab-delimited file and the command line version expect the data in ARFF format. So you will need to run
the TAB2ARFF.pl converter first to convert data format. This requires PERl to be installed in your computer.
Data sampling [3] is a useful procedure when the data to be analyzed is of imbalanced class distribution
(samples from one class (referred to as majority class) outnumber samples from another class
(referred to as minority class)). This is because the classification algorithms (classifiers) may be biased
towards the majority class. Random over sampling (randomly increases the minority class so as to match the
majority class) and random under sampling (randomly decrease the majority class so as to match the minority
class) are the simplest ways to balance the class distribution. The disadvantage of random over sampling is it
duplicates samples in the minority class which may be completely ignored by some classifiers.
As to random under sampling, it throws information away! Currently, the most popular and intelligent
(as oppose to random) over sampling strategy is SMOTE [4] which produce "new" samples instead of directly
duplicate samples from minority class. However, there is no (few) good proposal for intelligent under sampling.
Here, we propose an intelligent under sampling approach by using sample subset optimization technique.
The usefulness of each sample from majority class are evaluated, and the most useful samples from the
majority class are selected to combine with samples from the minority class. This result in a balanced
under sampling dataset.
The following two figures display the decision boundary of a kNN classifier build on an imbalanced dataset
without using SSO-based sampling and with SSO-based sampling.
References
[1] Yang, P.#, Yoo, P., Fernando, J., Zhou, B., Zhang, Z. & Zomaya, A. (2014).
Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications.
IEEE Transactions on Cybernetics, 44(3), 445-455.
[IEEE Xplore]
[PDF]
[2] Yang, P.#, Xu, L., Zhou, B., Zhang, Z. & Zomaya, A. (2009).
A particle swarm based hybrid system for imbalanced medical data sampling.
BMC Genomics, 10, S34.
[Pubmed]
[3] He, H, & Garcia, E. (2009). Learning from Imbalanced Data.
IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.
[4] Chawla, N., Bowyer, K., Hall, L. & Kegelmeyer, W. (2002).
SMOTE: Synthetic Over-sampling TEchnique.
Journal of Artificial Intelligence Research, 16, 341-378.