Daesoo Lee's Blog

Problems with Clustering Mixed Data

Data/Machine learning 2021. 1. 10. 08:26

Mixed data: data where a numeric variable and a categorical variable coexist. Problem Statement K-menas and PCA are most appropirate for continuous variables. For smaller datasets, it is better to use hierarchical clustering with Gower's distance. In principle there is no reason why K-menas can't be applied to binary or categorical data. You would usually use the "one hot encoder" representation..

Gower's Distance

Data/Machine learning 2021. 1. 8. 17:16

It measures the dissimilarity between records within a mixed dataset. The mixed dataset refers to a dataset where both a numeric feature (variable) and a categorical feature (variable) exist together. The basic idea behind Gower's distance is to apply a different distance metric to each variable depending on the type of data: For numeric variables and ordered factors (ordered categorical variabl..

Happiness lies within our mind

Daily lives 2021. 1. 8. 13:41

Up to how we perceive the world.

Outlier Detection with Multivariate Normal Distribution

Data/Machine learning 2021. 1. 8. 13:28

Let's say, we have a dataset with two random variables that follow the normal distribution: Then, we can build a 2-dimensional normal distribution that fits the above dataset: The probability contour would look like that: Since we usually consider data outside of the 95% confidence level as an outlier, we can set the boundary between non-outliers and outliers as an ellipse with the probability o..

ABOUT ME

Daesoo Lee's Blog

티스토리툴바