-
Naive Bayes ExplainedMathematics 2021. 1. 4. 19:39
Theory
Before we get started, please memorize the notations used in this article:
$X=(X1, X2, ..., X_k)$ represents $k$ features. $Y$ is the label with $K$ possible values (class).
From a probabilistic perspective, $X_i \in X$ and $Y$ are random variables.
The value of $X_i$ is $x$, and that of $Y$ is $y$.Basic Idea
To make classifications, we need to use $X$ to predict $Y$. In other words, given a data point $X=(x_1, x_2, ..., x_k)$, what is the odd of $Y$ being $y$. This can be rewritten as the following equation:
$$ P(Y=y | X=(x_1,x_2, ..., x_k)) $$
$$classification = argmax_y P(Y=y|X=(x_1, x_2, ..., x_k))$$
This is the basic idea of Naive Bayes, the rest of the algorithm I really more focusing on how to calculate the conditional probability above.
Bayes Theorem
Let's take a look at an example training dataset:
We can formulate the classification for 'Play' given 'Outlook, Temperature, Humidity, and Windy' as follows:
$$ P(Y|X) = \frac{P(X|Y) P(Y)}{P(X)} $$
$$ Posterior = \frac{likelihood \times prior}{evidence} $$
where $X = \{ X_1, X_2, X_3, X_4 \} = \{$ Outlook, Temperature, Humidity, Windy $\}$, and $Y=y=$ Play. Then, it can be reformulated as:
$$ P(Y|X) $$
$$ = P(Y=y | X_1=x_1, X_2=x_2, X_3=x_3, X_4=x_4)$$
$$ \approx P(X_1=x_1, X_2=x_2, X_3=x_3, X_4=x_4 | Y=y) P(Y=y) $$
Here, $P(X_1=x_1, X_2=x_2, X_3=x_3, X_4=x_4 | Y=y)$ means the probability of $(Y=y)$ given $(X_1=x_1)$ and $(X_2=x_2)$ and $(X_3=x_3)$ and $(X_4=x_4)$, and the 'and' is interpreted as intersection $\cap$. If we assume that $X_1$, $X_2$, $X_3$, and $X_4$ are independent of each other, we can write:
$$ P(X_1=x_1, X_2=x_2, X_3=x_3, X_4=x_4 | Y=y) $$
$$ = P(X_1=x_1 | Y=y) P(X_2=x_2 | Y=y) P(X_3=x_3 | Y=y) P(X_4=x_4 | Y=y) $$
With the simplified formula, the calculation becomes much easier. This is the Naive Bayes. The general form of te Naive Bayes is written as:
$$ P(Y|X) = \frac{P(Y) \prod_i P(X_i | Y)}{P(X)} $$
Continuous Data
The Naive Bayes works only with categorical predictors (e.g., with spam classification, where presence or absence of words, phrases, characters, and so on, lies at the heart of the predictive task). For continuous features, there are essentially two choices: discretization and continuous Naive Bayes.
Discretization works by breaking the data into categorical values. The simplest discretization is uniform binning, which creates bins with a fixed range. There are, of course, smarter and more complicated ways such as Recursive minimal entropy partitioning or SOM-based partitioning.
The continuous Naive Bayes utilizes known distributions such as a normal distribution. The continuous Naive Bayes can be written as:
$$ P(Y|X) = \frac{P(Y) \prod_i f(X_i | Y)}{P(X)} $$
$f$ is the probability density function. If a feature $X_i$ can be assumed that it follows a normal distribution, it is fair to make such an assumption that the feature is noramlly distributed.
The first step is estimating the mean and variance of the feature for a given label $y$. Let $S$ be data points with $Y=y$.
$$ \mu^{\prime} = \frac{\sum^S{X_i}}{len(S)} $$
$$ {\sigma^{\prime}}^2 = \frac{1}{len(S)-1} \sum^S (X_i - \mu^{\prime})^2 $$
Now we can calculate the probability density $f(x)$
$$ f(X_i=x | Y=y) = \frac{1}{\sqrt{2 \pi {\sigma^{\prime}}^2}} e^{\frac{(x-\mu^{\prime})^2}{2 { {\sigma^{\prime}}^2 }}} $$
Source: towardsdatascience.com/naive-bayes-explained-9d2b96f4a9c0
'Mathematics' 카테고리의 다른 글
Impurity Metric - Gini, Entropy (0) 2021.01.05 Prediction Interval and Confidence Interval (0) 2021.01.04 Leverage, Influential values, Cook's distance in Regression (0) 2021.01.02