What are outliers?
An observation that greatly deviates from the normal behavior, i.e., an event that is very inconsistent with the rest of the data set, is called an outlier. It leaves up to the analyst to decide what will be considered abnormal. It is necessary to characterize normal behavior before abnormal behavior can be singled out.
Why do outliers occur?
According to Tom Bodenberg, Data Scientist and Economist at Unity Marketing, “An outlier can be the result of measurement or recording errors, or the unintended outcome resulting from the set’s definition.”
It’s important to speculate why outliers occur at all. This could be due to a number of reasons, the value could be from a measurement where the equipment wasn’t working properly or even from human error or transcription error. It’s easy to recheck and correct the record when you know the reasons for an error. But what if there are no reasons to suspect an error?
Why Outlier Detection is important?
Data analytics deals with making observations with various data sets and trying to make sense of the data. One of the most important tasks when dealing with very large data sets is to find an outlier. Outliers can potentially skew or bias any analysis performed on the data set. It is therefore very important to detect and adequately deal with outliers.
The importance of outlier detection stems from the fact that for a variety of application domains anomalies in data often translate to significant and often critical actionable insights. For instance, an anomalous traffic pattern in a computer network could mean that a hacked computer is sending out sensitive data to an unauthorized destination. Anomalies in credit card transaction data could indicate credit card or identity theft.
How to measure Outlier Detection?
1. Time Series Analysis
Time series analysis is analysis of sequence of data taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data to extract meaningful statistics of the data. Time series analysis is done to predict future values based on previously observed values.
Let’s say for instance, you’re measuring the number of users on your website every minute. If there are five users on minute one, nine on minute two, three on minute three, and so on till we figure out maximum users, for this example, let’s say 100. And then suddenly if you’ve 900 users you’ll know that something has gone wrong. This is one way to measure outlier detection — Time Series Analysis where you figure out what your time series is, the maximum of time series and the deviation on those time series.
Disadvantages of this method
Static baselining is done based on the average behaviour of the user’s login patterns. This method calculates a one-time, or fixed baseline value based on readings monitored for a time interval you specify. Hence, it can handle only limited data and fails to manage increased data consumption. Moreover, changing threshold manually is a tedious task.
2. Profilers
Profiler is a method by which you map out all the different trees and branches of a particular case. It will be unique for each individual. Let’s say for instance, Albert has access to only two databases: Master Database and User database. In Master database, he only performs SELECT and UPDATE operations. In user database, he performs only SELECT operation. So when he accesses Master database and Product Catalog, it’s an outlier. And when he uses DELETE within the master database, that becomes the second outlier. If it’s not normal, then you identify that as an outlier.
How to SNIFF out outliers with DNIF?
DNIF creates Profilers which are unique baselines for individuals. DNIF goes beyond static threshold-based baselining and provides you Login Activity Dashboard with Active VPN Connections, Login Successful by User, Successful User Authentications and Login Failures by Users.
DNIF identifies and notifies when the count of login fail increases by specified value of attempts. The following query creates a Profiler with a 20% threshold on upper limit on average login done by individual users:
_fetch * from event where $Duration=24h AND $LogName=NIX AND $Action=LOGIN_FAIL NOT $User=root AND $SubSystem=AUTHENTICATION group count_unique $SystemName limit 100
>>_field $ULimit expr count_unique + count_unique * 20/100
>>_store in_disk User_Profiler stack_replace
The following query creates another Profiler with a 20% threshold on lower limit on average login done by individual users:
_fetch * from event where $Duration=24h AND $LogName=NIX AND $Action=LOGIN_FAIL NOT $User=root AND $SubSystem=AUTHENTICATION group count_unique $SystemName limit 100
>>_field $Llimit expr count_unique - count_unique * 20/100
>>_store in_disk User_Llimit_Profiler stack_replace
Cron Schedule: 0 0 * * *
Advantages of this method
The primary and foremost advantage of using profilers for outlier detection is that it helps to differentiate data for unique device IP. This method helps to set threshold on a daily basis using a cron scheduler. It is used to run the queries periodically at fixed times, dates, or intervals. It is most suitable for scheduling repetitive tasks.
It’s syntax is as follows: Cron Schedule: * * * * *
The first asterisk represents minute (0–59), the second asterisk represents hour (0–23), the third asterisk represents day of month (1–31), the fourth asterisk represents month (1–12) and the fifth asterisk represents day of week (0–6) where Sunday is 0. The asterisk represents all numbers that can appear in the given position. For instance, If you want to run a query every hour you can schedule it as follows:
Cron Schedule: 0 * * * *
Conclusion
Outliers seem to noticeably affect the result and that’s why it is very important to discover and detect outliers in large data sets. With DNIF’s unique outlier detection method, you can quickly and more accurately detect outliers in complex and very large data sets.