Imbalanced data is a common problem with data that has to be used for machine learning classification. Class differences can be found in many different areas including Medical Diagnosis, Spam filtering, and Fraud Detection,
The main problem with Imbalanced Data is that most of the Machine Learning Algorithms work best with balanced data i.e. data where each class of the target variable is equally represented. This is also known as Accuracy Paradox.
Eg: In the Credit Card fraud detection dataset, only 1.7% of transactions are labeled fraudulent. But if we think in a layman way then a classifier that determines all transactions as non-fraudulent will give better results than most Machine Learning Models without taking much processing time. But it will be crossed in real life.
How to find if the data is imbalanced?
Seeing the distribution of the different variables is really helpful in visualizing if there is an imbalance in the case of numerical variables.
One way of checking the imbalance in the data is by calculating the skewness of the distribution. In the ideal case, we want the distribution of the numerical variable to be present in a normal distribution. Skewness is the measure of the shift of the highest occurring value from the center to either side of the center.
The main property of a Normal Distribution is that the Mean, Median, and Mode all lie on the same point. But in real-world cases, it is not feasible to have a normal distribution in every case. We check the skewness of the distribution to check if the errors that might occur during the evaluation of the model are due to the skewness in the initial observations.
The skewness of any distribution can be calculated using the Pearson Coefficient of Skewness
The degree of tailedness of a distribution is measured by kurtosis. It tells us the c extent to which the distribution is more or less outlier-prone (heavier or light-tailed) than the normal distribution. Three different types of curves are shown as follows −
The kurtosis of any univariate normal distribution is 3. It is common to compare the kurtosis of a distribution to this value. Distributions with kurtosis less than 3 are said to be platykurtic, although this does not imply the distribution is “flat-topped” as sometimes reported. Rather, it means the distribution produces fewer and fewer extreme outliers than does the normal distribution.
Tactics To Combat Imbalanced Training Data
1. Collect More Data
If possible collecting more data can be very helpful in dealing with Imbalanced Datasets. But we have to take into account that the additional data has more concentration of the deficient class.
It is very difficult to gather more data into datasets created over specific time periods or when the probability of a target event happening is very less.
2. Change the Performance Metric
Accuracy is not the metric that should be used while working with imbalanced datasets. As it ends up misleading. There are special matrices that are designed to tell you a more truthful story that should be used.
Some other alternative evaluation metrics are:
- Precision/Specificity: how many selected instances are relevant.
- Recall/Sensitivity: how many relevant instances are selected.
- F1 score: harmonic mean of precision and recall.
- MCC: correlation coefficient between the observed and predicted binary classifications.
- AUC: relation between true-positive rate and false-positive rate.
3. Using Sampling Techniques
Resampling is a widely adopted technique for dealing with highly unbalanced datasets. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).
Despite the advantage of balancing classes, these techniques also have their weaknesses.
In under-sampling, we try to remove records of the abundant feature randomly to create a balance amongst the features. It is a very basic technique but using it causes a lot of information loss. Hence this method is used when we have a very big dataset and low imbalance.
In over-sampling, we try to duplicate random records of the feature present in scarce to create a balance amongst the features. It has its own disadvantage namely overfitting for the minority feature. Hence this method is used when we have a smaller dataset.
4. Generating Synthetic Samples
Another take on dealing with imbalanced data is to create synthetically generated data points for the minority classes. Synthetic Minority Over-sampling Technique (SMOTE) is one such technique that generates new observations by interpolating between observations in the original dataset.
5. Try Different Algorithms
Using Diversified Algorithms can also help in removing the Accuracy Paradox. The best way is to use algorithms that are not that much affected by the biases in the target variable.
Also, an important piece of information is that the tree methods are very much resilient from this type of dataset. As the decision tree looks into splitting rules, it can force both classes to be addressed.
If you want better results then the Decision Tree. Try using some lesser-known algorithms like the C4.5, C5.0, CART, and Random Forest.
6. Use Penalized Models
Some machine learning algorithms increase the cost of classification mistakes on the minority class that is known as penalized models. A popular algorithm for this technique is Penalized-SVM.