[2021.03.15] Few-shot Learning; Self-supervised Learning

Paper review 2021. 3. 15. 13:43

S. Gidaris et al., 2018, "Dynamic few-shot visual learning without forgetting"

This paper proposes a few-shot object recognition system that is capable of dynamically learning novel categories from only a few training data while at the same time does not forget the base categories on which it was trained.

To achieve that, the authors introduced the following two:
1) Classifier of a ConvNet as a cosine similarity function between feature representations and classification vectors
2) Attention-based few-shot classification weight generator.

Cosine-similarity based ConvNet recognition model

The standard setting for classification neural networks is, after having extracted the feature vector $z$, to estimate the classification probability vector $p = C(z | W^*)$ by first computing the raw classification score $s_k$ of each category $k \in [1, K^*]$ using the dot-product operator $s_k = z^T w_k^*$ where $w_k$ is the $k$-th classification weight vector in $W^*$, and then applying the softmax operator across all the $K^*$ classification scores, i.e., $p_k = \mathrm{softmax}(s_j)$.

In this paper, the classification weight vector $w_k^*$ could come both from the base categories, i.e., $w_k^* \in W_{base}$, and the novel categories, i.e., $w_k^* \in W_{novel}$. However, the mechanisms involved during learning those classification weights are very different. Due to those differences, the weight values in those two cases can be completely different, and so the same applies to the raw classification scores computed with the dot-product operation, which can thus have totally different magnitudes depending on whether they come from the base of novel categories. To over this issue, the authors propose to modify the classifier $C(. | W^*)$ and compute the raw classification scores using the cosine similarity operator:

$$ s_k = \tau \cdot \mathrm{cos}(z, w_k^*) = \tau \cdot \bar{z}^\mathrm{T} \bar{w}_k^* $$

where $\bar{z}= \frac{z}{|| z ||}$ and $\bar{w}_k^* = \frac{w_k^*}{|| w_k^* ||}$ are the $l_2$-normalized vectors, and $\tau$ is a learnable scalar value. Since the cosine similarity can be implemented by first $l_2$-normalizing the feature vector $z$ and the classification weight vector $w_k^*$ and then applying the dot-product operator, the absolute magnitudes of the classification weight vectors can no longer affect the value of the raw classification score.

In addition to the above modification, the authors also choose to remove the ReLU non-linearity after the last hidden layer of the feature extractor, which allows the feature vector $z$ to take both positive and negative values.

The results from the paper show that the cosine-similarity-based classifier leads the feature extractor to learn features that generalize significantly better on novel categories than features learned with the dot-product-based classifier. A possible explanation for this is that, in order to minimize the classification loss of a cosine-similarity-based ConvNet model, the $l_2$-normalized feature vector of an image must be very closely matched with the $l_2$-normalized classification weight vector of its ground truth category. As a consequence, the feature extractor is forced to (a) learn to encode those discriminative visual cues on its feature activations which the classification weight vectors of the ground truth categories learn to look for, and (b) learn to generate $l_2$-normalized feature vectors with low intra-class variance, since all the feature vectors that belong to the same category must be very closely matched with the single classification weight vector of that category. Moreover, the cosine-similarity-based classification objective resembles the training objectives typically used by metric-learning approaches.

Few-shot classification-weight generator based on attention

The few-shot classification weight generator $G(. , . | \phi )$ receives the feature vectors $Z^{\prime} = \{z^{\prime}_i \}^{N^{\prime}}_{i=1}$ of the $N^{\prime}$ training examples of a novel category and (optionally) the classification weight vectors of the base categories $W_{base}$ as its input. Based on them, it infers a classification weight vector $w^{\prime} = G( Z^{\prime}, W_{base} | \phi )$ for that novel category. Here, we explain how the above few-shot classification weight generator is constructed:

Feature averaging-based weight inference

An obvious choice to infer the classification weight vector $w^{\prime}$ is by averaging the feature vectors of the training examples (after they've been $l_2$-normalized): $w^{\prime}_{avg} = \frac{1}{N^{\prime}} \sum_{i=1}^{N^{\prime}} \bar{z}_i^{\prime} $. The final classification weight vector if we only use the feature averaging mechanism is: $w^{\prime} = \phi_{avg} \odot w^{\prime}_{avg} $ where $\phi_{avg} \in \mathbb{R}^d$ is a learnable weight vector. A similar strategy has been previously proposed by the Prototypical Network. However, it does not fully exploit the learned knowledge and the averaging cannot infer an accurate classification weight vector when there is only a single raining example for the novel category.

Attention-based weight inference

The authors enhance the above feature averaging mechanism with an attention-based mechanism that composes novel classification weight vectors by "looking" at a memory that contains the base classification weight vectors $W_{base} = \{ w_b \}_{b=1}^{K_{base}}$. More specifically, an extra attention-based classification weight vector $w^{\prime}_{att}$ is computed as:

$$ w_{att}^{\prime} = \frac{1}{N^{\prime}} \sum_{i=1}^{N^{\prime}} \sum_{b=1}^{K_{base}} Att(\phi_q \bar{z}_i^{\prime}, k_b) \cdot \bar{w}_b $$

where $\phi_q \in \mathbb{R}^{d \times d}$ is a learnable weight matrix that transforms the feature vector $\bar{z}_i^{\prime}$ to query vector used for querying the memory, $\{ k_b \in \mathbb{R}^d \}_b^{K_{base}}$ is a set of $K_{base}$ learnable keys used for indexing the memory, and $Att(. , .)$ is an attention kernel implemented as a cosine similarity function followed by a softmax operation over the $K_{base}$ base categories, and $N^{\prime}$ denotes a set of novel categories (typically, $N^{\prime} <= 5$). Note that the classification weight vector of a novel category can be composed as a linear combination of those base classification weight vectors that are most similar to the few training examples of that category. This allows our few-shot weight generator to explicitly explot the acquired knowledge.

The final classification weight vector is computed as: $w^{\prime} = \phi_{avg} \odot w^{\prime}_{avg} + \phi_{att} \odot w^{\prime}_{att} $ where $\phi_{avg}, \phi_{att} \in \mathbb{R}^d$ are learnable weight vectors.

H. Qi et al., 2017, "Low-shot learning with imprinted weights"

Weight imprinting is proposed in this paper. It directly sets the final layer weights for new classes from the embedding of training exemplars. Consider a single raining sample $x_+$ from a novel class, our method computes the embedding $\phi(x_+)$ and uses it to set a new column in the weight matrix for the new class, i.e., $w_+ = \phi(x_+)$:

Model Architecture

Different from standard ConvNet classifier architectures, the authors add an $L_2$ normalization layer at the end of the embedding extractor so that the output embedding has unit length, i.e. $|| \phi(x) ||_2 = 1$.

Average embedding

If $n > 1$ examples $\{ x_+^{(i)} \}_{i=1}^n$ are available for a new class, we compute new weights by averaging the normalized embedding $\tilde{w}_+ = \frac{1}{n} \sum_{i=1}^n \phi (x_{+}^{(i)}) $ and re-normalizing the resulting vector to unit length $w_+ = \frac{\tilde{w}_+}{ || \tilde{w}_+ || } $. In practice, the averaging operation can also be applied to the embedding computed from the randomly augmented versions of the original low-shot training examples.

W. Chen et al., 2020, "A closer look at few-shot classification"

This paper presents 1) a consistent comparative analysis of several representative few-shot classification algorithms, 2) a modified baseline method: Baseline++, which is like a cosine-similarity classifier, 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classification algorithms.

One of the best findings is Baseline++, which is equal to the cosine-similarity classifier. Although it's a relatively much simpler model than the prior SOTAs such as MatchingNet, ProtoNet, MAML, and RelationNet, its performances are very competitive.

This paper also explains the overview of few-shot classification algorithms:

Overview of Few-shot Classification Algorithms

Baseline

Training stage: We train a feature extractor $f_{\theta}$ (parametrized by the network parameters $\theta$) and the classifier $C(\cdot | \boldsymbol{W}_b)$ (parametrized by the weight matrix $\boldsymbol{W}_b \in \mathbb{R}^{d \times c}$) from scratch by minimizing a standard cross-entropy classification loss $L_{pred}$ using the training examples in the base classes $\boldsymbol{x}_i \in \boldsymbol{W}_b$. Here, we denote the dimension of the encoded feature as $d$ and the number of output classes as $c$. The classifier $C(\cdot | \boldsymbol{W}_b)$ consists of a linear layer $\boldsymbol{W}_b^T f_{\theta}(\boldsymbol{x}_i)$ followed by a softmax function $\sigma$.

Fine-tuning stage: To adapt the model to recognize novel classes in the fine-tuning stage, we fix the pre-trained network parameters $\theta$ in our feature extractor $f_{\theta}$ and train a new classifier $C(\cdot | \boldsymbol{W}_n)$ (parametrized by the weight matrix $\boldsymbol{W}_n$) by minimizing $L_{pred}$ using the few labeled of examples (i.e., the support set) in the novel classes $\boldsymbol{X}_n$.

Baseline++

The baseline++ explicitly reduces intra-class variation among features during training.

Meta-learning Algorithms

Relation between few-shot learning and meta-learning: To do the few-shot classification (learning), the meta-learning algorithm is used. Hence, the few-shot learning is a higher-level concept.

In this paper, MatchingNet, ProtoNet, RelationNet, and MAML are considered

The meta-leanring process shown in the above figure is explained below:

Experimental Results

Experimental Setup

Scenarios (such as classification and cross-domain scenario) and datasets are determined.

Results

저작자표시 (새창열림)

'Paper review' 카테고리의 다른 글

[2021.04.12] Barlow Twins (0)	2021.04.15
[2021.03.09] Few-shot Learning; Self-supervised Learning (0)	2021.03.11
[2021.01.28] Unsupervised learning, Semi-Supervised Learning (0)	2021.01.28

Comments

ABOUT ME

Daesoo Lee's Blog