How to Find Outliers in Python
Find outlier is essential for a host of reasons from not skewing your averages to ensuring that machine learning algorithms function properly. In this tutorial, I am going show you several ways in which you can remove outliers from your data. We explore a few traditional methods such as Z-score and IQR. In addition, we will look at Tukey Ladders and ML algorithms that find the outliers for us.
Watch the video presentation of the how to find outliers in Python
Using Z-Score to Find Outliers in Python
In this method, we calculate the z-score for each data point and identify the ones with a Z-score greater than a certain threshold. The benefit of using this method is super easy to use. However, the negative aspect of this is that this method assumes that your data is normally distributed. Here’s how you can code it in Python:
#import the packages that we need to find outlier
import numpy as np
from scipy import stats
#create some fake data
data = np.array([1, 2, 3, 4, 5, 10, 20, 30])
#appy the z-score method and lets get the absolute values
z_scores = np.abs(stats.zscore(data))
# lest using a threshold of 1.96 because this 95% of the data threshold
threshold = 1.96
outliers = data[z_scores > threshold]
print(outliers)
It’s important to note that the data is assume to distributed in a manner seen below in the bell-shaped curve. This allows us to make assumptions on how many standard deviations the data is from the mean which is an element of the normal distribution formula.
Using Standard Deviation to Find Outliers in Python
Using standard deviations to detect outliers is based on the idea that data points that are a certain number of standard deviations away from the mean are considered outliers. This is similar to the method we describe previously with the Z-score with the same limitations.
The general idea is to calculate the mean and standard deviation of the data, and then identify data points that are a certain number of standard deviations away from the mean. The number of standard deviations that are used as the threshold for identifying outliers can be adjusted depending on the desired level of sensitivity and the distribution of the data
#import the packages
import numpy as np
#create the data and take the mean and standard deviation
data = np.array([1, 2, 3, 4, 5, 10, 20, 30])
mean = np.mean(data)
std_dev = np.std(data)
#More than 3 standard deviations from the mean an outlier
threshold = 3
#create the condition to find outliers
outliers = data[np.abs(data - mean) > threshold * std_dev]
print(outliers)
Using IQR or Boxplot Method to Find Outliers
This method we are evaluating the data into quartiles (25% percentile, 50% percentile and 75% percentile ). We calculate the interquartile range (IQR) and identify the data points that lie outside the range. Here is how calculate the upper and lower data limits
Lower range limit = Q1 – (1.5* IQR). Essentially this is 1.5 times the inner quartile range subtracting from your 1st quartile.
Higher range limit = Q3 + (1.5*IQR). This is 1.5 times IQR+ quartile 3.
Here is how you would implement this Python with some simple to use code
# import the packages
import numpy as np
# create the data
data = np.array([1, 2, 3, 4, 5, 10, 20, 30])
# assign your quartiles, limits and iq3
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_bound = q1 - 1.5*iqr
upper_bound = q3 + 1.5*iqr
#create conditions to isolate the outliers
outliers = data[(data < lower_bound) | (data > upper_bound)]
print(outliers)
Tukey Latter Method
This method is similar to the IQR method, but it uses a different threshold to identify outliers. Instead of 1.5IQR, we use a threshold of kIQR, where k is a parameter that can be set by the user. Also we incorporate the median into this value instead of of using the mean. Here’s how you can implement it in Python
#import the package you need
import numpy as np
from scipy.stats import iqr
data = np.array([1, 2, 3, 4, 5, 10, 20, 30])
# change K to 2 or 1.5. This can be what you need it to be.
k = 2.0
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_bound = np.median(data) - k*iqr
upper_bound = np.median(data) + k*iqr
outliers = data[(data < lower_bound) | (data > upper_bound)]
print(outliers)
Isolation Forest Algorithm
Isolation Forest detects outliers unsupervised. The approach builds many isolation trees, binary trees that iteratively separate data into smaller subsets. Each isolation tree is created by randomly picking a subset of the data and recursively splitting it using a random feature and split value.
Isolation Forest assumes that outliers are isolated and have fewer tree structure paths to reach them than inliers, who are more connected. Hence, the method finds outliers as isolated tree points with short path lengths.
After building isolation trees, the system calculates anomaly scores for each data point to identify outliers. The average path length of the data point across all isolation trees determines the anomaly score. Shorter average path lengths indicate anomalies and outliers.
#bring in the Isolation forest library
from sklearn.ensemble import IsolationForest
#create your data
data = np.array([1, 2, 3, 4, 5, 10, 20, 30]).reshape(-1, 1)
#istantiate the algorithm and se estimators and contamination and fit the data
isolation_forest = IsolationForest(n_estimators=100, contamination=0.1)
isolation_forest.fit(data)
# Isolate the negative 1 which would be the algorithms
outliers = data[isolation_forest.predict(data) == -1]
print(outliers)
Note that these are just a few of the many methods that can be used to identify outliers. The choice of method depends on the specific requirements of your analysis and the nature of your data.