ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Outlier Detection with Multivariate Normal Distribution
    Data/Machine learning 2021. 1. 8. 13:28

    Let's say, we have a dataset with two random variables that follow the normal distribution:

    Dataset where the two random variables follow the normal distribution

    Then, we can build a 2-dimensional normal distribution that fits the above dataset:

    Multivariate normal distribution

    The probability contour would look like that:

    Probability contour

    Since we usually consider data outside of the 95% confidence level as an outlier, we can set the boundary between non-outliers and outliers as an ellipse with the probability of 0.05.

    You should note that people do not commonly use $p$-value here for purposes like ours or hypothesis tests. The $p$-value corresponding to a coordinate $(T_1, T_2)$ in the two-dimensional normal distribution is defined as follows (source: here):

    $p$-value in the two-dimensional normal distribution. Note that the rectangular region expands infinitely

    The $p$-value in this situation is not particularly meaningful. What is commonly used in this situation is something like this:

    where the measure of the orange region is now the probability that a random point would have been "less likely" (e.g. have a lower probability density) than $(T_1, T_2)$.

     

    Source: here

     

    'Data > Machine learning' 카테고리의 다른 글

    Gower's Distance  (0) 2021.01.08
    Hierarchical Clustering (Agglomerative Algorithm)  (0) 2021.01.07
    K-means implementation in GoogleColab  (0) 2021.01.06

    Comments