Abnormal Detection — Fraud Detection using Multivariate Gaussian Technique
by F. N . Logothetis
In data analysis, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions.
Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as “normal” and “abnormal” and involves training a classifier . Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the learnt model.
In the majority of the cases, we have an unbalanced dataset, with many normal cases and rare abnormal, hence, well-known classifiers such as SVM, decision trees, or even neural nets will fail. In this paper, we adopt a unsupervised technique, called Multivariate Gaussian (MG), in order to tackle the issue of unbalanced or unlabeled data. Let’s revise the basics of MG.
In probability theory and statistics, the MG is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value. Its importance derives mainly from the multivariate central limit theorem. The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. Let’s say that we want to detect an anomaly of aircraft engine and the mechanical engineers give as only some features, called x_i, where x1= heat generated, x2 = vibration intensity, x_N = fuel consumption. For better representation of the input information, we collect all the above features in a vector x = [x1, x2,.., xN]. The MG is formulated as,
where the parameters μ, Σ is calculated as
where m is the size of the training set. The μ is a vector and at each row contains the mean value of each input feature x_i. The second parameter (Σ) is the covariance matrix, which describes the level of correlation of the input features x. The |Σ| is the determinant of the matrix Σ, and x^(i) is the i^th training sample. Hence, μ, Σ are derived from the training set.
p(x; μ, Σ) indicates the probability of a specific input x (how probable is the fact that x follows the GM distribution with parameters (μ, Σ)). If this probability is under a threshold ( ε), then the input is not normal (not part of our GM, abnormal). On the other hand, if this probability is greater or equal than ε ( p(x; μ, Σ) ≥ ε), then the input x is normal.
Now let’s dive into another problem, which is called ‘Credit Card Fraud Detection’. It is important for credit card companies to recognize fraudulent credit card transactions so that customers are not charges for items that they did not purchase. A available dataset of for this problem could be found at Kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud. Download .csv file and open it. This dataset comprises a series of transactions, with 28 features (from V1- V28). For confidentiality reasons, we don’t know exactly the name of each feature, but the dataset is labeled. If Class = 1, then the transaction is fraud and abnormal. Our goal is to classify correctly the fraud/non fraud transactions.
Step 1: Read .csv file
Step 2: Since the features cannot be plotted, due to the their high dimensional space, we extract the two most important principal components.
The above figure depicts, the two most important principal components. We observe that some fraud and non fraud cases are overlapped, hence, it will be difficult to classify correctly all the possible inputs.
When we make assumption about the underlying distribution of the data, we have to be sure that this assumption is approximately close to reality. For this problem, we assume that the data are drawn by normal gaussian distribution. The following figures, show that the input features are almost similar to gaussian. If the data are non-gaussian , then is considered profitable strateg to transform them using the functions log (x), log(x +1), sqrt(x), or x^ 0.5.
Step 3: Calculate multivariate gausssian parameters and pdf.
ε-hyperparameter is considered to be critical for the performance of the algorithm. Technically speaking, ε value is the threshold that divides the decision boundaries into two pieces. An input with probability under that threshold will be considered abnormal, and it will be classified as fraud. The most promising technique of finding the best ε laying on trail and error. In other words, all the possible ε values will be examined and the optimal option will be selected. The aforementioned technique is called cross validation. As final step, the MG is evaluated by leveraging the test data.