Dealing with Data Outliers

Dhruv Khanna
2 min readOct 23, 2021

What is a Data Outlier?

An outlier is a data point that is distinct from other similar points.

It may be due to variability in the measurement or may indicate experimental errors/ If, possible, outliers should be excluded from the dataset. However, detecting these instances might be difficult, and is not always possible.

Machine learning algorithms are very sensitive to the range and distribution of attribute values. Data outliers can spoil and mislead the training process resulting in longer training times, less accurate models, and ultimately poorer results.

How to Detect Outliers?

1. Univariate method

One of the simplest methods for detecting outliers is the use of box plots. Also known as a whisker plot it is used to graphical display the distribution of the data. Box plots use the median and the lower and upper quartiles. you can only detect outliers using this method for only one feature variable.

2. Multivariate method

Outliers do not need to be extreme values. Therefore, as we have seen with Point B, the univariate method does not always work well. The multivariate method tries to solve that by building a model using all the data available and then cleaning those instances with errors above a given value.

3. Makowski Error

The Makowski error is a loss index that is more insensitive to outliers than the standard sum squared error. The sum squared error raises each instance error to the square, making a too big contribution of outliers to the total error. The Makowski error solves that by raising each instance error to a number smaller than 2, for instance, 1.5. This reduces the contribution of outliers to the total error. For instance, if an outlier has an error of 10, the squared error for that instance will be 100, while the Makowski error will be 31.62.

--

--