ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Problems with Clustering Mixed Data
    Data/Machine learning 2021. 1. 10. 08:26

    Mixed data: data where a numeric variable and a categorical variable coexist.

    Problem Statement

    K-menas and PCA are most appropirate for continuous variables. For smaller datasets, it is better to use hierarchical clustering with Gower's distance. In principle there is no reason why K-menas can't be applied to binary or  categorical data. You would usually use the "one hot encoder" representation. However, in practice, using K-means and PCA with binary data can be difficult.

    If the stadnard z-scores are used, the binary variables will dominate the difinition of the clusters. This is because 0/1 variables take on only two  values and K-means can obtain a small within-cluster sum-of-squares by assigning all the records with a 0 or 1 to a single cluster.

    Solutions

    To avoid this behavior,

    • you could scale the binary variables to have a smaller variance than other variables.
    • Alternatively, for very large data sets, you could apply clustering to different subsets of data taking on specific categorical values. For example, you could apply clustering separately to those loans made to someone who has a mortagage, owns a home outright, or rents.

    'Data > Machine learning' 카테고리의 다른 글

    Distance Metrics  (0) 2021.01.11
    Gower's Distance  (0) 2021.01.08
    Outlier Detection with Multivariate Normal Distribution  (0) 2021.01.08

    Comments