-
Noise Contrastive Estimation and Negative SamplingData/Machine learning 2021. 3. 23. 14:31
Reference: [C. Dyer, 2014, "Notes on Noise Contrastive Estimation and Negative Sampling"]; Check out my Mendeley.
Estimating the parameters of probabilistic models of language such as probabilistic neural models is computationally difficult since it involves evaluating partition functions by summing over an entire vocabulary. Two closely related strategies - noise contrastive estimation and negative sampling have emerged as popular solutions to this computational problem.
1. Introduction
Let us assume the following model of language which predicts a word $w$ in a vocabulary $V$ based on some given context $c$ (By language model I mean a model that generates one word at a time, conditional on any other ambient context such as previously generated or surrounding words):
kinda looks like a softmax where $u_{\theta}(w,c) = \mathrm{exp}(s_{\theta}(w, c))$ assigns a score to a word in context, $Z(c)$ is the partition function hat normalizaes this into a probability distribution, and $s_{\theta}(w,c)$ is differentiable with respect to $\theta$. The standard learning procedure is to maximize the likelihood of a sample of training data. Unfortunately, computing this probability (and its derivatives) is expensive since this requires summing over all words in $V$, which is generally very large.
To solve that problem, classical strategies such as importance sampling and related Monte Carlo techniques has been used. Noise contrastive estimation and negative sampling represent an evolution of these techniques. These work by transforming the computationally expensive learning problem into a binary classification proxy problem that uses the same parameters but requires statistics that are easier to compute.
1.1 Expirical distributions, Noise distributions, and Model distributions
I will refer to $\hat{p}(w|c)$ and $\hat{p}(c)$ as empirical distributions. Our task is to find the parameters $\theta$ of a model $p_{\theta}(w|c)$ that approximates the empirical distribution as closely as possible, in terms of minimal cross entropy. To avoid costly summations, a "noise" distribution, $q(w)$, is used. In practice, $q$ is a uniform, empirical unigram distribution.
2. Noise Contrastive Estimation (NCE)
NCE reduces the languages model estimation problem to the problem of estimating the parameters of a probabilistic binary classifier that uses the same parameters to distinguish samples from the empirical distribution from samples generated by the noise distribution. The two-class training data is generated as follows: sample a $c$ from $\hat{p}(c)$, then sample one "true" sample from $\hat{p}(w|c)$, with auxiliary label $D=1$ indicating the datapoint is drawn from the true distribution, and $k$ "noise" samples from $q$, with auxiliary label $D=0$ indicating these data points are noise. Thus, given $c$, the joint probability of $(d,w)$ in the two-class data has the form of the mixture of two distributions:
Using the definition of conditional probability, this can be turned into a conditional probability of $d$ having observed $w$ and $c$:
$$ p(d,w|c) = \frac{ p(d,w,c) }{ p(c) } = \frac{ p(d|w,c) p(w,c) }{ p(c) } = \frac{ p(d|w,c) p(w|c) p(c) }{ p(c) } = p(d|w,c) p(w|c)$$
Note that these probabilities are written in terms of the empirical distribution.
NCE replaces the empirical distribution $\hat{p}(w|c)$ with the model distribution $p_{\theta}(w|c)$, and $\theta$ is chosen to maximize the conditional likelihood of the "proxy corpus" created as described above. But, thus far, we have not solved any computational problem yet: $p_{\theta}(w|c)$ still requires evaluating the partition function. To avoid the expense of evaluating the partition function, NCE makes two further assumptions. First, it proposes partition function value $Z(c)$ be estimated as parameter $z_c$. Second, for neural networks with lots of parameters, it turns out that fixing $z_c = 1$ for all $c$ is effective. This latter assumption both reduces the number of parameters and encourages the model to have "self-normalized" outputs (i.e, $Z(c) \approx 1$). Making these assumptions, we can now write the conditional likelihood of being a noise sample or true distribution sample in terms of $\theta$ as
We now have a binary classification problem with parameters $\theta$ that can be trained to maximize conditional log-likelihood of a dataset $\mathcal{D}$, with $k$ negative samples chosen:
Note that the conditional log-likelihood is essentially: $\prod_{d=0}^1 p(D=d | w,c)$.
Unfortunately, the expectation of the second term in $\mathcal{L}_{NCE_k}$ is still a difficult summation - it is $k$ times the expected log probability of producing a negative label under the noise distirubiton over all words in $V$ in a context $c$. We still have a loop over the entire vocabulary. The final step is therefore to replace this expectation with its Monte Carlo approximation:
In its implementation, a classifier in a neural network model would be just a linear layer without a softmax activation function. The last linear layer essentially outputs the score $s_{\theta}(w, c)$, in which $u_{\theta}(w,c) = \mathrm{exp}( s_{\theta}(w,c) )$. You can refer to this NCE implementation in PyTorch.
3. Negative Sampling
Negative sampling is a variation of NCE used by the popular word2vec tool which generates a proxy corpus and also learns $\theta$ as a binary classification problem, but it defines the conditional probabilities given $(w,c)$ differently:
This objective can be understood as follows: it is equivalent to NCE when $k = |V|$ and $q$ is uniform. As a result, aside from the $k = |V|$ and uniform $q$ case, the conditional probabilities of $D$ given $(w,c)$ are not consistent with the languagage model probabilities of $(w,c)$ and therefore the $\theta$ estimated using this as an objective will not optimize the likelihood of the language model in $Eq.(1)$. Thus, while negative sampling may be appropriate for word representation learning, it does not have the same asymptotic consistency guarantees that NCE has.
'Data > Machine learning' 카테고리의 다른 글
W&B Tips (0) 2021.04.04 Variational Auto Encoder (VAE) (0) 2021.03.19 Cosine-similarity Classifier; PyTorch Implementation (0) 2021.03.17