Outlier Detection with Multivariate Normal Distribution

Data/Machine learning 2021. 1. 8. 13:28

Let's say, we have a dataset with two random variables that follow the normal distribution:

Dataset where the two random variables follow the normal distribution

Then, we can build a 2-dimensional normal distribution that fits the above dataset:

The probability contour would look like that:

Since we usually consider data outside of the 95% confidence level as an outlier, we can set the boundary between non-outliers and outliers as an ellipse with the probability of 0.05.

You should note that people do not commonly use $p$-value here for purposes like ours or hypothesis tests. The $p$-value corresponding to a coordinate $(T_1, T_2)$ in the two-dimensional normal distribution is defined as follows (source: here):

$p$-value in the two-dimensional normal distribution. Note that the rectangular region expands infinitely

The $p$-value in this situation is not particularly meaningful. What is commonly used in this situation is something like this:

where the measure of the orange region is now the probability that a random point would have been "less likely" (e.g. have a lower probability density) than $(T_1, T_2)$.

Source: here

저작자표시 (새창열림)

'Data > Machine learning' 카테고리의 다른 글

Gower's Distance (0)	2021.01.08
Hierarchical Clustering (Agglomerative Algorithm) (0)	2021.01.07
K-means implementation in GoogleColab (0)	2021.01.06

Comments

ABOUT ME

Daesoo Lee's Blog

'Data > Machine learning' 카테고리의 다른 글

티스토리툴바