-
Problems with Clustering Mixed DataData/Machine learning 2021. 1. 10. 08:26
Mixed data: data where a numeric variable and a categorical variable coexist. Problem Statement K-menas and PCA are most appropirate for continuous variables. For smaller datasets, it is better to use hierarchical clustering with Gower's distance. In principle there is no reason why K-menas can't be applied to binary or categorical data. You would usually use the "one hot encoder" representation..
-
Gower's DistanceData/Machine learning 2021. 1. 8. 17:16
It measures the dissimilarity between records within a mixed dataset. The mixed dataset refers to a dataset where both a numeric feature (variable) and a categorical feature (variable) exist together. The basic idea behind Gower's distance is to apply a different distance metric to each variable depending on the type of data: For numeric variables and ordered factors (ordered categorical variabl..
-
Outlier Detection with Multivariate Normal DistributionData/Machine learning 2021. 1. 8. 13:28
Let's say, we have a dataset with two random variables that follow the normal distribution: Then, we can build a 2-dimensional normal distribution that fits the above dataset: The probability contour would look like that: Since we usually consider data outside of the 95% confidence level as an outlier, we can set the boundary between non-outliers and outliers as an ellipse with the probability o..
-
Hierarchical Clustering (Agglomerative Algorithm)Data/Machine learning 2021. 1. 7. 12:05
Hierarchical clustering provides an intuitive graphical display (e.g. dendrogram). However, it cannot be used with a large dataset. For even a modest-sized dataset with just tens of thousands of records, it can require intensive computing resources. In fact, most of the applications of hierarchical clustering are focused on relatively small datasets. Algorithm (Hierarchical Clustering = Agglomer..
-
K-means implementation in GoogleColabData/Machine learning 2021. 1. 6. 17:02
Implementation of K-means in GoogleColab to take a close look at the process of the K-means (link here). Result The used dataset is as follows: 1. Assign the initial cluster means (by randomly assigning each record to one of the $K$ clusters) Then, the (initial) cluster means are the means of the assigned records to each cluster: $$ \bar{x}_k = \frac{1}{n_k} \sum_{i \in k}^{n_k}{x_i} $$ $$ \bar{..
-
Ensemble, Bagging, and Random ForestData/Machine learning 2021. 1. 6. 11:46
Bagging (Bootstrap aggregating) is one of the Ensemble approaches. Ensemble An ensemble of predictive models is averaging (or taking majority votes) of multiple predictive models. The simple version of ensembles is as follows: Develop a predictive model and record the predictions for a given dataset. Repeat for multiple models, on the same data. For each record to be predicted, take an average (..