ABOUT ME

-

Today: -

Yesterday: -

Total: -

Daesoo Lee's Blog

An Introduction to Statistical Learning with Applications in R

Note-taking/Books 2021. 1. 12. 17:04

Notation and Simple Matrix Algebra

$n$ to represent the number of distinct data points.

$p$ denotes the number of variables.

$\boldsymbol{X}$ denotes a $n \times p$ matrix whose $(i, j)$-th element is $x_{ij}$. That is,

$$ \boldsymbol{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} $$

a vector of length $n$ is denoted in lower case bold; e.g.

$$ \boldsymbol{a} = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} $$

Matrices are denoted using bold capitals such as $\boldsymbol{A}$.

Random variables are denoted using a capital normal font such as $A$ regardless of their dimensions.

To indicate that it is a vector of length $k$, we will use $\boldsymbol{a} \in \mathbb{R}^n$.

To indicate that it is a $r \times s$ matrix, we will use $\boldsymbol{A} \in \mathbb{R}^{r \times s}$

4.4 Linear Discriminant Analysis (LDA)

LAD is like PCA, but it focuses on maximizing the separability among known variables. In the following example, a dataset with two classes are separated by $b_0 + b_1 X_1 + b_2 X_2$:

source: StatQuest

Using Bayes' Theorem for Classification

Suppose that we wish to classify an observation into one of $K$ classes, where $K \ge 2$. Let $\pi_{k}$ represent the overall or prior probability that a randomly chosen observation comes from the $k$-th class. Let $f_k(x) \equiv Pr(X=x | Y=k)$ denote the density function of $X$ for an observation that comes from the $k$-th class (refer to here regarding the use of the density function).

Bayes' Theorem

Then, Bayes' theorem states that:

$$ Pr(Y=k | X=x) = \frac{P(X=x | Y=k) P(Y=k)}{P(X=x)} $$

$$ = \frac{P(X_x \cap Y_k)}{P(X_x)} $$

$$ = \frac{P(X_x \cap Y_k)}{P(X_x \cap Y_1) + P(X_x \cap Y_2) + \cdots + P(X_x \cap Y_l) + \cdots + P(X_x \cap Y_K)} $$

$$ = \frac{P(X_x \cap Y_k)}{\sum_{l=1}^{K}{P(X_x \cap Y_l)}} $$

$$ = \frac{P(Y=k)P(X=x | Y=k)}{\sum_{l=1}^{K}{P(Y=l)P(X=x | Y=l)}} $$

$$ = \frac{\pi_k f_k{(x)}}{\sum_{l=1}^{K}{\pi_l f_l(x)}} \tag{4.10}$$

Linear Disriminant Analysis for $p=1$

Assume that $p=1$, that is, we have only one predictor, and $f_k(x)$ is a normal distribution. In the one-dimensional setting, the normal distribution takes the form:

$$ f_k(x) = \frac{1}{\sqrt{2 \pi \sigma_k^2}} \exp{\left( -\frac{1}{2 \sigma_k^2} (x-\mu_k)^2 \right)} \tag{4.11}$$

where $\mu_k$ and $\sigma_k^2$ are the mean and variance parameters for the $k$-th class. For now, let's further assume that $\sigma_1^2 = \cdots = \sigma_K^2$: that is, there is a shared variance term across al $K$ classes, which for simplicity we can denote by $\sigma^2$. Plugging $Eq.(4.11)$ into $Eq.(4.10)$, we find:

$$ Pr(Y=k | X=x) = p_k(x) = \frac{\pi_k \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{\left( -\frac{1}{2 \sigma^2} (x-\mu_k)^2 \right)}}{\sum_{l=1}^{K}{\pi_l \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{\left( -\frac{1}{2 \sigma^2} (x-\mu_l)^2 \right)}} } $$

The Bayes classifier involves assigning an observation $X=x$ to the class for which $Eq.(4.12)$ is largest. Taking the log of $Eq.(4.12)$ and rearranging the terms, it is not hard to show that this is equivalent to assigning the observation to the class for which

$$ \delta_k(x) = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2 \sigma^2} + \log{(\pi_k)} \tag{4.13}$$

is largest.

In practice, we have to estimate the parameters $\mu_1, \cdots, \mu_K$, $\pi_1, \cdots, \pi_K$, and $\sigma^2$. The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates for $\pi_k$, $\mu_k$, and $\sigma^2$ into $Eq.(4.13)$. In particular, the following estimates are used:

$$ \hat{\mu}_k = \frac{1}{n_k} \sum_{i:y_=k}{x_i} \tag{4.15}$$

$$ \hat{\sigma}^2 = \frac{1}{n - K} \sum_{k=1}^{K} \sum_{i : y_i = k} (x_i - \hat{\mu}_k)^2 \tag{4.15} $$

Sometimes, we have knowledge ofthe class membership probabilities $\pi_1, \cdots, \pi_K$, which can be used directly. In th absence of any addtional information, the LDA estimates $\pi_k$ using the proportion of the training observations that belong to the $k$-th class. In other words,

$$ \hat{\pi}_k = n_k / n $$

The LDA classifier plugs the estimates given in $Eq.(4.15)$ and $Eq.(4.16)$ into $Eq.(4.13)$, and assigns an observation $X=x$ to the class for which

$$ \hat{\delta}_k(x) = x \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat{\mu}_k^2}{2 \hat{\sigma}^2} + \log{(\hat{\pi}_k)} \tag{4.17}$$

is largest.

In short, the LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and a common variance, and plugging estimates for these parameters into the Bayes classifier. To improve the LDA, we can allow the observations in the $k$-th class to have a class-specific variance, $\sigma_k^2$ instead of a common variance, $\sigma^2$, which is Quadratic Discriminant Analysis (QDA).

Linear Discriminant Analysis for $p > 1$

We now extend the LDA classifier to the case of multiple predictors. To do this, we will assume that $X = (X_1, X_2, \cdots, X_p)$ is drawn from a multivariate normal distribution with a class-specific mean vector and a common covaraince matrix.

Formally, the multivariate normal distribution is defined as:

$$ f(\boldsymbol{x}) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2} } \exp{\left( -\frac{1}{2} (\boldsymbol{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{x}-\boldsymbol{\mu}) \right)} $$

In the case of $p > 1$ predictors, the LDA classifire assumes that the obsrevations in the $k$-th class are drawn from a multivariate normal distribution $N(\boldsymbol{\mu}_k, \boldsymbol{\Sigma})$, where $\boldsymbol{\mu}_k$ is a class-specific mean vector, and $\boldsymbol{\Sigma}$ is a covariance matrix that is common to all $K$ classes. Plugging the normal distribution function for the $k$-th class, $f_k(X=\boldsymbol{x})$, into $Eq.(4.10)$ and performing a little bit of algebra reveals that the Bayes classifier assigns an observation $X=\boldsymbol{x}$ to the class for which

$$ \delta_k(\boldsymbol{x}) = \boldsymbol{x}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \frac{1}{2} \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k + \log{\pi}_k $$

is largest. This is the vector/matrix version of $Eq.(4.13)$.

Quadratic Discriminant Analysis

The LDA assumes that the observations within each class are drawn from a multivariate normal distribution with a class-specific mean vector and a covariance matrix that is common to all $K$ classes. Quadratic discriminant analysis (QDA) provides an alternative approach. Like the LDA, the QDA classifier results from assuming that the observations from each class are drawn from a normal distribution, and plugging estimates for the parameters into Bayes' theorem in order to perform prediction. However, unlike the LDA, the QDA assumes that each class has its own covariance matrix. That is, it assumes that an observation from the $k$-th class is of the form $X \sim N(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$, where $\boldsymbol{\Sigma}_k$ is a covariance matrix for the $k$-th class. So the QDA classifier involves plugging estimates for $\boldsymbol{\Sigma}_k, \boldsymbol{\mu}_k$, and $\pi_k$.

4.5 Comparison of KNN, LDA, Logistic, and QDA

According to $Fig. 4.10$ and $Fig. 4.11$ on p.153, when decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well. When the boundaries are moderately non-linear, QDA may give better results. Finally, for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.

7.2 Step Functions

We break the range of $X$ into bins, and fit a different constant in each bin. This amounts to converting a continuous variable into an ordered categorical variable.

In greater detail, we create cutpoints $c_1, c_2, \cdots, c_K$ in the range of $X$, and then contrsuct $K+1$ new variables:

$$ C_0(X) = I(X < c_1) $$

$$ C_1(X) = I(c_1 \le X < c_2) $$

$$ C_2(X) = I(c_2 \le X < c_3) $$

$$ \vdots $$

$$ C_{K-1}(X) = I(c_{K-1} \le X < c_K) $$

$$ C_{K}(X) = I(c_{K} \le X) $$

where $I(.)$ is an indicator function that returns a 1 if the condition is true, and returns a 0 otherwise. We then use least squares to fit a linear model using $C_1(X), C_2(X), \cdots, C_K(X)$ as predictors:

$$ y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) + \cdots + \beta_K C_K(x_i) + \epsilon_i \tag{7.5} $$

Example of the step function. The dotted lines indicate an estimated 95% confidence interval

7.4 Regression Splines

Piecewise Polynomials

It involves fittnig separate low-degree polynomials over different regions of $X$. A piecewise cubic polynomial with a single knot at a point $c$ takes the form:

$$ y_i = \left\{ \begin{array}{ll} \beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \beta_{31}x_i^3 + \epsilon_i \quad x_i < c; \\ \beta_{02} + \beta_{12}x_i + \beta_{22}x_i^2 + \beta_{32}x_i^3 + \epsilon_i \quad x_i \ge c \\ \end{array} \right. $$

Using more knots leads to a more flexible piecewise polynomial.

Constraints and Splines

Various piecewise polynomials. The knot is represented by a dashed line

where Top Left: The cubic polynomials are unconstrained. Top Right: The cubic polynomials are constrained to be continuous at the knot. Bottom Left: The cubic polynomials are constrained to be continuous and to have continuous first and second derivatives. Bottom Right: A linear spline is shown which is constrained to be continuous.

7.7 Generalized Additive Models (GAM)

7.7.1 GAMs for Regression Problems

A natural way to extend the multiple linear regression model

$$ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i $$

in order to allow for non-linear relationships between each feature and the response is to replace each linear component $\beta_j x_{ij}$ with a (smooth) non-linear function $f_j(x_{ij})$. We would then write the model as

$$ y_i = \beta_0 + \sum_{j=1}^{p} f_j (x_{ij}) + \epsilon_i $$

This is an example of a GAM. It is called an additive model because we calcualte a separate $f_j$ for each $X_j$, and then add together all of their contributions. For example,

$$ wage = \beta_0 + f_1(year) + f_2(age) + f_3(education) + \epsilon $$

7.7.2 GAMs for Classification Problems

Let $Y$ be a binary variable - 0 or 1, and let $p(X) = Pr(Y=1 | X)$ be the conditional probability (given the predictors) that the response equals one. Recall the logistic regression model which consists of a linear function of the predictors:

$$ \log{\left( \frac{p(X)}{1-p(X)} \right)} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p \tag{7.17}$$

This logit is the log of the odds of $P(Y=1 | X)$ versus $P(Y=0 | X)$. A natural way to extend $Eq.(7.17)$ to allow for non-linear relationships is to use the model,

$$ \log{\left( \frac{p(X)}{1-p(X)} \right)} = \beta_0 + \beta_1 f_1(X_1) + \beta_2 f_2(X_2) + \cdots + \beta_p f_p(X_p) \tag{7.18}$$

8.2 Bagging, Random Forests, boosting

8.2.2 Random Forests

Random forests provide an improvement over bagged trees (using the bagging, bootstrapping + aggregating) by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of $m$ predictors is chosen as split candidates from the full set of $p$ predictors, where $m \approx \sqrt{p}$.

In other words, in building a random forest, t each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors. This may sound crazy, but it has a clever rationale. Suppose that there is one very strong predictor in the dataset, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated. Unfortunately, averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities.

Random forests overcome this problem by forcing each split to consider only a subset of the predictors.

9 Support Vector Machines

The support vector machine is a genrealization of a simple and intuitive classifier called the maximal margin classifier.

9.1 Maximal Margin Classifier

9.1.1 What is a Hyperplane?

In a $p$-dimensional space, a hyperplane is a flat affine subspace of dimension $p-1$. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace - in other words, aline. In three dimensions, a hyperplane is a flat two-dimensional subspace - that is, a plane.

The mathematical definition of a hyperplane is quite simple. In two dimensions, a hyperplane is defined by the equation:

$$ \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0 \tag{9.1}$$

for parameters $\beta_0$, $\beta_1$, and $\beta_2$.

Now, suppose that $X$ does not satisfy $Eq.(9.1)$; rather,

$$ \beta_0 + \beta_1 X_1 + \beta_2 X_2 > 0 $$

Then this tells us that $X$ lies to one side of the hyperplane. On the other hand, if

$$ \beta_0 + \beta_1 X_1 + \beta_2 X_2 < 0 $$

then $X$ lies on the other side of the hyperplane. So we can think of the hyperplane as dividing $p$-dimensional space into two halves. An example of a hyperplane in two-dimensional space is shown in the following figure:

The hyperplane $1+2X_1 + 3X_2 = 0$ is shown. The blue region is the set of points for which $1+2X_1+3X_2 > 0$, and the purple region is the set of points for which $1 + 2X_1 + 3X_2 <0$.

The concept of the hyperplane in the two dimensions can easily be generalized to the $p$ dimensions by adding $\beta_p$ and $X_p$.

9.1.2 Classification Using a Separating Hyperplane

We will now see a new approach that is based upon the concept of a separating hyperplane.

Suppose that we label the observations from the blue class as $y_i = 1$ and the those from the purple class as $y_i = -1$. Then a separating hyperplane has the property that

$$ \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} > 0 \quad if \; y_i=1 $$

$$ \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} < 0 \quad if \; y_i=-1 $$

Equivalently, a separating hyperplane has the property that

$$ y_i ( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} ) > 0 \tag{9.8} $$

for all $i = 1, \cdots, n$.

We can also make use of the magnitude of the separating hyperplane - $f(x_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}$. If $f(x_i)$ is large, then this means thta $x_i$ lies far from the hyperplane, and so we can be confident about our class assignment for $x_i$. On the other hand, if $f(x_i)$ is close to zero, then $x_i$ is located near the hyperplane, and so we are less certain about the class assignement for $x_i$.

A separating hyperplane with observations $x_i$

9.1.3 The Maximal Margin Classifier

In general, if our data can be perfectly separated using a hyperplane, then there will in fact exist an infinite number of such hyperplanes. This is because a given separating hyperplane can usually be shifted a tiny bit up or down, or rotated, without coming into contact with any of the observations. Three possible separating hyperplanes are shown in the following figure:

Three possible separating hyperplanes among an infinite number of such hyperplanes

A method to determine one optimal hyperplane is the maximal margin hyperplane (also known as the optimal separating hyperplane), which is the separating hyperplane that is farthest from the training observations.

The maximal margin hyperplane is shown as a solid line. The margin is the distance from the solid line to either of the dashed lines. The two red points and the blue point that lie on the dashed lines are the support vectors.

where the margin is the distance between a data point closest to the optimal hyperplane and the optimal hyperplane. The maximal margin hyperplane is the separating hyperplane for which the margin is the largest.

If $\beta_0, \beta_1, \cdots, \beta_p$ are the coefficients of the maximal margin hyperplane, then the maximal margin classifier classifies the test observation $x^*$ based on the sign of $f(x^*) = \beta_0 + \beta_1 x_1^* + \cdots + \beta_p x_p^* $.

The support vectors are data points that lie in the dashed line as shown in the above figure. They "support" the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyperplane would move as well. Interestingly, the maximal margin hyperplane depends directly on the support vectors, but not on the other observations.

9.1.4 Construction of the Maximal Margin Classifier

We now consider the task of constructing the maximal margin hyperplane based on a set of $n$ training observations $x_1, \cdots, x_n \in \mathbb{R}^p$ an dassociated class labels $y_1, \cdots, y_n \in \{ -1, 1 \}$. Briefly, the maximal margin hyperplane is the solution to the optimization problem:

$$ maximize M \tag{9.9}$$

$$ subject \: to \: \sum_{j=1}^{p}{\beta_j^2} = 1, \tag{9.10}$$

$$ y_i (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} ) + \cdots + \beta_p x_{ip} \ge M \: \forall \: i=1, \cdots, n \tag{9.11}$$

This optimization problem is actually simpler than it looks. First of all, the constraint in $Eq.(9.11)$ guarantees that each observation will be on the correct side of the hyperplane, provided that $M$ is positive. Second, note that $Eq.(9.10)$ is not really a constraint on the hyperplane since if $\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} = 0$ defines a hyperplane, then so does $k(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}) = 0$ for any $k \ne 0$. However, $Eq. (9.10)$ adds meaning to $Eq.(9.11)$; one can show that with this constraint the perpendicular distance from the $i$-th observation to the hyperplane is given by

$$ y_i (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}) $$

Note the following two things to understand the above statement: 1) Equation to calculate the perpendicualr distance (refer to here), 2) The effect of $M$ (the right term) in $Eq.(9.11)$, well represented in the following figure:

(Blue) $M=1$, (Red) $M=0$, (Green) $M=-1$

Therefore, the constraints $Eq.(9.10)$ and $Eq.(9.11)$ ensure that each observation is on the correct side of the hyperplane and at least a distance $M$ from the hyperplane. Hence, $M$ represents the margin of our hyperplane, and the optimization problem chooses $\beta_0, \beta_1, \cdots, \beta_p$ to maximize $M$. This is exactly the definition of the maximal margin hyperplane. The optimization problem can be solved efficiently, but details of this optimization are outside of the scope of this book.

9.1.5 The Non-separable Case

The maximal margin classifier is a very natural way to perform classification if a separating hyperplane exists. However, in many cases, no separating hyperplane exists (at around a classification boundary, classes could be somewhat mixed), so the maximal margin classifier cannot be used. In this case, the optimization problem $Eqs. (9.9)-(9.11)$ has no solution with $M > 0$. An example is shown in the following figure:

There are two classes of observations. The two classes are not separable by a hyperplane, so the maximal margin classifier cannot be used.

In the next section, we will see that we can extend the concept of a separating hyperplane in order to develop a hyperplane that almost separates the classes using a so-called soft margin. The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier.

9.2 Support Vector Classifiers

9.2.1 Overview of the Support Vector Classifier

A classifier based on a separating hyperplane will necessarily perfectly classify all of the training observations; this can lead to sensitivity to individual observations. An example is shown below:

The resulting maximal margin hyperplane is not satisfactory - it has only a tiny margin and the separating hyperplane is not generalized well (= not robust). The fact that the maximal margin hyperplane is extremely sensitive to a change in a single observation suggests that it may have overfitted the training data.

In this case, we might be willing to consider a classifier based on a hyperplane that does not perfectly separate the two classes, in the interest of 1) Greater robustness to individual observations, 2) Better classification of most of the training observations. The support vector classifier, sometimes called a soft margin classifier, does exactly this.

9.2.2 Details of the Support Vector Classifier

The support vector classifier classifies a test observation depending on which side of a hyperplane it lies. the hyperplane is chosen to correctly separate most of the training observations into the two classes, but may misclassify a few observations. It is the solution to the following optimization problem,

$$ maximize \: M \tag{9.12}$$

$$ subject \: to \: \sum_{j=1}^{p}{\beta_j^2} = 1 \tag{9.13} $$

$$ y_i (\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}) \ge M(1 - \epsilon_i) \tag{9.14} $$

$$ \epsilon_i \ge 0, \: \sum_{i=1}^{n}{\epsilon_i \le C} \tag{9.15} $$

where $C$ is a non-negative tuning parameter. As in $Eq.(9.11)$, $M$ is the width of the margin. In $Eq.(9.14)$, $\epsilon_1, \cdots, \epsilon_n$ are slack variables that allow individual observations to be on the wrong side of the margin or the hyperplane. Once the optimization problem is solved, we can calssify the test observation based on the sign of $f(x^*) = \beta_0 + \beta_1 x_{1}^* + \cdots + \beta_p x_{p}^*$. As for $\epsilon_i$, if $\epsilon_i = 0$, then the $i$-th observation is on the correct side of the margin. If $\epsilon_i > 0$, then the $i$-th observation is on the wrong side of the margin, and we say that the $i$-th obsrevation has violated the margin. If $\epsilon_i > 1$, then it is on the wrong side of the hyperplane.

As for $C$ in $Eq.(9.51)$, $C$ determines the number and severity of the violations to the margin (and to the hyperplane) that we will tolerate. If $C = 0$, then the optimization problem here becomes the same as that of the maximal margin classifier. In practice, $C$ is treated as a tuning parameter.

The optimization problem $Eq. (9.12)-(9.15)$ has a very interesting property: it turns out that only observations that either lie on the margin or that violate the margin will affect the hyperplane. In other words, the observations that lie directly on the margin or on the wrong side of the margin for their class, known as support vectors, do affect the support vector classifier.

The fact that only support vectors affect the classifier is in line with our previous assertion that $C$ controls the bias-variance trade-off of the support vector classifier. If $C$ is large, the margin is wide, so many observations violate the margin, and there are many support vectors. In contrast, if $C$ is small, there will be fewer support vectors due to the narrow margin.

Top Left $\rightarrow$ Top Right $\rightarrow$ Bottom Left $\rightarrow$ Bottom Right: In a descending order of the magnitude of $C$

9.3 Support Vector Machines

We first discuss a general mechanism for converting a linear classifier into one that produces non-linear decision boundaries. We then introduce the support vector machine, which does this in an automatic way.

9.3.1 Classification with Non-linear Decision Boundaries

If the boundary between the two classes is linear, then the linear support vector classifier can work without a problem. However, in practice, we are sometimes faced with non-linear class boundaries. See the following figure:

Right: The linear support vector classifier performs very poorly

Previously, when a linear regression model suffered from a non-linear relationship between the predictors and the outcome (response), we enlarged the feature space using functions of the predictors such as quadratic and cubic terms to address the non-linearity. In the case of the support vector classifier, we could address this problem in a similar fashion by enlarging the feature space using quadratic, cubic, and even higher-order polynomial functions of the predictors. For instance, rather than fitting a support vector classifier using $p$ features

$$ X_1, X_2, \cdots, X_p $$

we could instead fit a support vector classifier using $2p$ features

$$ X_1, X_1^2, X_2, X_2^2, \cdots, X_p, X_p^2 $$

Then $Eqs.(9.12)-(9.15)$ would become:

$$ maximize \: M $$

$$ subject \: to \:\: y_i \left( \beta_0 + \sum_{j=1}^{p}{\beta_{j1} x_{ij}} + \sum_{j=1}^{p}{\beta_{j2}x_{ij}^2} \right) \ge M(1 - \epsilon_i), $$

$$ \sum_{i=1}^n{\epsilon_i} \le C, \: \epsilon_i \ge 0, \: \sum_{j=1}^p \sum_{k=1}^2 \beta_{jk}^2 = 1 $$

Why does this lead to a non-linear decision boundary? In the original feature space, the decision boundary is of the form $q(x)=0$, where $q$ is a quadratic polynomial, and its solutions are generally non-linear. Even higher-order polynomial terms or interaction terms can be used. There are many possible ways to enlarge the feature space, and that could end up with a huge number of features. Then, computations would become unmanageable. The support vector machine, which we present next, allows us to enlarge the feature space used by the support vector classifier in a way that leads to efficient computations.

9.3.2 The Support Vector Machine

The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space in a specific way using kernels. The kernel approach that we describe here is simply an efficient computational approach to enlarge our feature space to accommodate a non-linear boundary between classes.

We have not discussed exactly how the support vector classifier is computed because the details become somewhat technical. However, it turns out that the solution to the linear support vector classifier problem $Eqs. (9.12) - (9.15)$ involves only the inner products of the observations. Thus the inner product of two observations $x_i, x_{i^\prime}$ is given by

$$ \langle x_i, x_{i^\prime} \rangle = \sum_{j=1}^{p}{x_{ij} x_{i^\prime j}} \tag{9.17}$$

It can be shown that

- The linear support vector classifier can be represented as

$$ f(x) = \beta_0 + \sum_{i=1}^{n}{ \alpha_i \langle x, x_i \rangle } \tag{9.18}$$

where there are $n$ parameters $\alpha_i; \: i=1, \cdots, n,$ one per training observation.

- To estimate the parameters $\alpha_1, \cdots, \alpha_n$ and $\beta_0$, all we need are the $n(n-1)/2$ inner products $\langle x_i, x_{i^\prime} \rangle$ between all pairs of training observations.

Notice that in $Eq.(9.18)$, in order to evaluate the function $f(x)$, we need to compute the inner product between the new point $x$ and each of the training points $x_i$. However, it turns out that $\alpha_i$ is nonzero only for the support vectors in the solution - that is, if a training observation is not a support vector, then its $\alpha_i$ equals zero. So if $\mathcal{S}$ is the collection of indices of these support points, we can rewrite any solution function of the form $Eq.(9.18)$ as

$$ f(x) = \beta_0 + \sum_{i \in \mathcal{S}}{ \alpha_i \langle x, x_i \rangle } \tag{9.19} $$

which typically involves far fewer terms than in $Eq.(9.18)$.

To summarize, in representing the linear support vector classifier $f(x)$, and in computing its coefficients, all we need are inner products.

Now suppose that every time the inner product $Eq.(9.17)$ appears in the representation $Eq.(9.18)$, or in a calculation of the solution for the support vector classifier, we replace it with a generalization of the inner product of the form

$$ K(x_i, x_{i^\prime}) \tag{9.20} $$

where $K$ is some function that we will refer to as a kernel. A kernel is a function that quantifies the similarity of two observations. It must be noted that the term, kernel, used in the convolutional neural network is unrelated to the term, kernel, used here (in math, generally). For a simple kernel, we could simply take

$$ K(x_i, x_{i^\prime}) = \sum_{j=1}^p{ x_{ij} x_{i^\prime j} } \tag{9.21} $$

which would just give us back the linear support vector classifier. Recall that the inner product can describe a similarity between two vectors in a similar form as covariance and correlation. $Eq.(9.21)$ is known as a linear kernal because the support vector classifier is linear in the features; the linaer kernel essentially quantifies the similarity of a pair of observations using Pearson (standard) correlation. But one could instead choose another form for the kernel fuction. For instance, one could replace every instance of $\sum_{j=1}^p{x_{ij} x_{i^\prime j}}$ with the quantity

$$ K(x_i, x_{i^\prime}) = (1 + \sum_{j=1}^p{x_{ij} x_{i^\prime j}})^d \tag{9.22} $$

This is known as a polynomial kernel of degree $d$. It essentially amounts to fitting a support vector classifier in a higher-dimensional space involving polynomials of degree $d$. When the support vector classifier is combined with a non-linear kernel such as $Eq.(9.22)$, the resulting classifier is known as a suppor vector machine. The SVM can be represented as

$$ f(x) = \beta_0 + \sum_{i \in \mathcal{S}}{ \alpha_i K(x, x_i) } \tag{9.23}$$

There is another popular kernel function, called radial kernel, which takes the form

$$ \exp(-\gamma \sum_{j=1}^p (x_{ij} - x_{i^\prime j})^2) \tag{9.24}$$

where $\gamma$ is a positive constant. Given that $Eq.(9.24)$ has a form of $y = exp(-x)$, if $x_i$ and $x_{i^\prime}$ are similar, then $K(.)$ is close to 1, otherwise, $K(.)$ is close to 0.

Left: SVM with a polynomial kernel of degree 3. Right: SVM with a radial kernel

9.4 SVMs with More than Two Classes

The two most popular approaches are the one-versus-one and one-versus-all approaches.

Left: One-Versus-One approach. Right: One-Versus-All approach

저작자표시

'Note-taking > Books' 카테고리의 다른 글

Head First Statistics (0) 2021.01.16

Practical Statistics for Data Scientists (0) 2021.01.14

Comments

Popular Posts

ABOUT ME

Machine Learning Engineer / Data Scientist daesoolee2601@gmail.com

LINK

My Github

ADMIN

티스토리툴바