-
Problems with Clustering Mixed DataData/Machine learning 2021. 1. 10. 08:26
Mixed data: data where a numeric variable and a categorical variable coexist. Problem Statement K-menas and PCA are most appropirate for continuous variables. For smaller datasets, it is better to use hierarchical clustering with Gower's distance. In principle there is no reason why K-menas can't be applied to binary or categorical data. You would usually use the "one hot encoder" representation..
-
Gower's DistanceData/Machine learning 2021. 1. 8. 17:16
It measures the dissimilarity between records within a mixed dataset. The mixed dataset refers to a dataset where both a numeric feature (variable) and a categorical feature (variable) exist together. The basic idea behind Gower's distance is to apply a different distance metric to each variable depending on the type of data: For numeric variables and ordered factors (ordered categorical variabl..
-
-
Outlier Detection with Multivariate Normal DistributionData/Machine learning 2021. 1. 8. 13:28
Let's say, we have a dataset with two random variables that follow the normal distribution: Then, we can build a 2-dimensional normal distribution that fits the above dataset: The probability contour would look like that: Since we usually consider data outside of the 95% confidence level as an outlier, we can set the boundary between non-outliers and outliers as an ellipse with the probability o..