Problems with Clustering Mixed Data

Data/Machine learning 2021. 1. 10. 08:26

Mixed data: data where a numeric variable and a categorical variable coexist.

Problem Statement

K-menas and PCA are most appropirate for continuous variables. For smaller datasets, it is better to use hierarchical clustering with Gower's distance. In principle there is no reason why K-menas can't be applied to binary or categorical data. You would usually use the "one hot encoder" representation. However, in practice, using K-means and PCA with binary data can be difficult.

If the stadnard z-scores are used, the binary variables will dominate the difinition of the clusters. This is because 0/1 variables take on only two values and K-means can obtain a small within-cluster sum-of-squares by assigning all the records with a 0 or 1 to a single cluster.

Solutions

To avoid this behavior,

you could scale the binary variables to have a smaller variance than other variables.
Alternatively, for very large data sets, you could apply clustering to different subsets of data taking on specific categorical values. For example, you could apply clustering separately to those loans made to someone who has a mortagage, owns a home outright, or rents.

저작자표시 (새창열림)

'Data > Machine learning' 카테고리의 다른 글

Distance Metrics (0)	2021.01.11
Gower's Distance (0)	2021.01.08
Outlier Detection with Multivariate Normal Distribution (0)	2021.01.08

Comments

ABOUT ME

Daesoo Lee's Blog

Problem Statement

Solutions

'Data > Machine learning' 카테고리의 다른 글

티스토리툴바