How to Find Outliers in SQL
Finding outliers in your data is easier than you think. You can use statistics concepts to identify the threshold where outliers may fall. One of the most accurate ways is to use the Z-Score. This score is composed of the mean, standard deviation, and current value. You see from the image below how this is composed. This X would be the value that is being compared.
So the calculation of this score will give you a value that exists from the normal distribution. This distribution looks like a bell curve. This score will dictate the probability of where that particular data falls on this distribution scale. Let’s take a look at this on the scale. However first before we link this score to a point on the normal distribution. Let’s see how the standard deviation and to mean are associated with this distribution.
We can see that the mean is the middle and the standard deviation from that mean will give us extreme values. So 3 standard deviations from the mean. We can associate a Z-score with each one of this standard deviations from the mean. We call this translation of the normal distribution to Z-score a standard normal distribution
We can see that 3 standard deviation from the mean which is going to a clear outlier range translates to a plus 3 or a -3. So let’s dive into how to do this in SQL which is quite easy.
SQL Formula to Find Z-Scores
SELECT, Date, Users, (Users-AVG(Users) Over())/STDEV(Users) Over()) as Zscore from Web_Data
Let’s examine this SQL code. We essentially repeated the formal from above. to get a Z-score column in our data. Now that may not be at the end of your story. You will need to decide where your Outlier threshold should be at data that is only at the top 5% of extremes would have a Z-score of 1.96 or -1.96. You can check out the Z-Score table at the z-score.net
Use a Where Clause to set your Filter and Subquery
SELECT * FROM (SELECT, Date, Users, (Users-AVG(Users) Over())/STDEV(Users) Over() as Zscore FROM Web_Data) as z_table WHERE Zscore > 1.96 or Zscore < -1.96
This will provide data that is outside 95% of our data. So this would be the bottom 5% or the top 5%. You can adjust your Z-score to adjust your threshold level.