-
Outlier Detection with Multivariate Normal DistributionData/Machine learning 2021. 1. 8. 13:28
Let's say, we have a dataset with two random variables that follow the normal distribution:
Dataset where the two random variables follow the normal distribution Then, we can build a 2-dimensional normal distribution that fits the above dataset:
Multivariate normal distribution The probability contour would look like that:
Probability contour Since we usually consider data outside of the 95% confidence level as an outlier, we can set the boundary between non-outliers and outliers as an ellipse with the probability of 0.05.
You should note that people do not commonly use $p$-value here for purposes like ours or hypothesis tests. The $p$-value corresponding to a coordinate $(T_1, T_2)$ in the two-dimensional normal distribution is defined as follows (source: here):
$p$-value in the two-dimensional normal distribution. Note that the rectangular region expands infinitely The $p$-value in this situation is not particularly meaningful. What is commonly used in this situation is something like this:
where the measure of the orange region is now the probability that a random point would have been "less likely" (e.g. have a lower probability density) than $(T_1, T_2)$.
Source: here
'Data > Machine learning' 카테고리의 다른 글
Gower's Distance (0) 2021.01.08 Hierarchical Clustering (Agglomerative Algorithm) (0) 2021.01.07 K-means implementation in GoogleColab (0) 2021.01.06