[2021.01.28] Unsupervised learning, Semi-Supervised Learning

Paper review 2021. 1. 28. 19:09

R. Caruana, 1997, "Multitask Learning"

Multi-task learning (MTL)

It is an inductive transfer mechanism whose principal goal is to improve generalization performance. The MTL improves generalization by leveraging the domain-specific information contained in the training signals of related tasks. In effect, the training signals for the extra tasks serve as an inductive bias.

The standard methodology in machine learning is to learn one task at a time.

If the tasks can share what they learn, the learner may find it is easier to learn them together than in isolation. Thus, if we simultaneously train a net to recognize object outlines, shapes, edges, regions, subregions, textures, reflections, highlights, shadows, text, orientation, size, distance, etc., it may learn better to recognize complex objects in the real world.

Fig. Multitask Backpropagation of four tasks with the same inputs

Z. Wu et al., 2018, "Unsupervised Feature Learning via Non-Parametric Instance Discrimination"

This paper attempts to learn good representations via unsupervised learning. It shows that a well-learned feature extractor is very useful to do other tasks such as object detection.

The goal is to learn an embedding function $\boldsymbol{v} = f_{\theta}(x)$ without supervision. $f_{\theta}$ is a deep neural network with parameters $\theta$, mapping image $x$ to feature $\boldsymbol{v}$. This embedding would induces a metric over the image space, as $d_{\theta}(x,y) = || f_{\theta}(x) - f_{\theta}(y) ||$ for instances $x$ and $y$. A good embedding should map visually similar images closer to each other. This paper's novel unsupervised feature learning approach is instance-level discrimination. The authors treat each image instance as a distinct class of its own and train a classifier to distinguish between individual instance classes.

Non-Parametric Classifier

Under the conventional parametric softmax formulation, for image $x$ with feature $\boldsymbol{v} = f_{\theta}(x)$, the probability of it being recognized as $i$-th example is:

$$ P(i | \boldsymbol{v}) = \frac{ \mathrm{exp}( \boldsymbol{w}_i^T \boldsymbol{v} ) }{ \sum_{j=1}^n \mathrm{exp}( \boldsymbol{w}_j^T \boldsymbol{v} ) } \tag{1}$$

where $\boldsymbol{w}_j$ is a weight vector for class $j$, and $\boldsymbol{w}_j^T \boldsymbol{v}$ measures how well $\boldsymbol{v}$ matches the $j$-th class i.e., instance.

The authors propose a non-parametric variant of $Eq.(1)$ that replaces $\boldsymbol{w}_j^T \boldsymbol{v}$ with $\boldsymbol{v}_j^T \boldsymbol{v}$, and we enforce $|| \boldsymbol{v} || = 1$ via a L2-normalization layer. This replacement allows instace-level comparison instead of comparing with a class prototype, $\boldsymbol{w}$. Then, the probability $P(i | \boldsymbol{v})$ becomes:

$$ P(i | \boldsymbol{v}) = \frac{ \mathrm{exp}( \boldsymbol{v}_i^T \boldsymbol{v} / \tau ) }{ \sum_{j=1}^n \mathrm{exp}( \boldsymbol{v}_i^T \boldsymbol{v} / \tau ) } \tag{2}$$

where $\tau$ is a temperature parameter that controls the concentration level of the distribution. $\tau$ is important for supervised feature learinng, and also necessary for tuning the concentration of $\boldsymbol{v}$ on our unit sphere.

Fig. Effects of the temperature; The larger temperature ($\tau > 1$) smoothes out the concentration, and the lower temperature ($\tau < 1$) intensifies the concentration.

The learning objective is then to maximize the joint probability $\prod_{i=1}^n P_{\theta}(i | f_{\theta}(x_i))$, or equivalently to minimize the negative log-likelihood (NLL) over the training set, as

$$ J(\theta) = -\sum_{i=1}^n \mathrm{log} P(i | f_{\theta}(x_i)) $$

Memory bank

To compute the probability $P(i | \boldsymbol{v})$ in $Eq.(2)$, $\boldsymbol{v}_j$ for all the images are needed. Instead of exhaustively computing these representations every time, the authors maintain a feature memory bank $V$ for storing them. Let $V = {\boldsymbol{v}_j}$ be the memory bank and $\mathrm{\boldsymbol{f}}_i = f_{\theta}(x_i)$ be the feature of $x_i$. During each learning iteration, the representation $\mathrm{\boldsymbol{f}}_i$ as well as the network parameters $\theta$ are optimized via SGD. Then, $\mathrm{\boldsymbol{f}}_i$ is updated to $V$ at the corresponding instance entry $\mathrm{\boldsymbol{f}}_i \rightarrow \boldsymbol{v}_i$.

Semi-supervised learning

A common scenario that can benefit from unsupervised learning is when we have a large amount of data of which only a small fraction are labeled. A natural semi-supervised learning approach is to first learn from the big unlabeled data and then fine-tune the model on the small labeled data. Namely, the representation learning is conducted with the big unlabeled dataset and the fine-tuning is conducted with the small labeled data.

Results

The methods from 'Data-Init' to 'SplitBrain' refer to the pretext tasks; The linear evaluation is used with features from 'conv1' to 'conv5'. The result shows that the 'latter layers' and a 'deeper backbone' capture better representations.

A. Van Den Oord et al., 2019, "Representation Learning with Contrastive Predictive Coding" (CPC)

One of the most common strategies for unsupervised learning has been to predict future (forecasting), missing, or contextual information: "predictive coding"

In this paper, the authors propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which the authors call Contrastive Predictive Coding. The key insight of the proposed model is to learn such representations by predicting the future in latent space by using powerful autoregressive modes. This approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text, and reinforcement learning.

In the figure, $g_{enc}, g_{ar}, z_t, c_t$ denote an encoder, autoregressive model, latent representation, and context latent representation. Note that the prediction is made to predict the latent representation. The reason behind this is that we do not necessarily have to reconstruct the data in the high dimension (original input's dimension), which would be much harder. To do the predictive coding, prediction of the latent representation in the low dimension (latent embedding space) would be easier for the model.

Results

It shows that prediction of the (kinda) far future is important for learning useful features which leads to the high classification performance although the prediction accuracy on the far future is not that high.

Long H. Nguyen et al., 2019, "Self-boosted Time-series Forecasting with multi-task and Multi-view Learning"

Traditional approaches usually require additional feature sets. However, adding more feature sets from different sources of data is not always feasible. In this paper, the authors propose a novel self-boosted mechanism in which the original univariate time series is decomposed into multivariate time series.

Multi-view

Having multiple features instead of one feature:

The decomposed closely related time series are fed to build a multi-task learning model while the decomposed loosely related time series group is utilized for multi-vew learning. The main task is forecasting at the forecast horizon, and the auxiliary tasks are forecasting related time series.

Empirical Ensemble Decomposition (EEDM) algorithm

It decomposes an original one-dimensional (univariate) time series into multiple components as shown in the above figure (Fig. 2). These decomposed time series can be categorized into two groups: 1) Closely related group to the original time series, 2) Loosely related group to the original time series. The similarity is measured by the correlation coefficient.

The EEMD decomposition transforms non-linear, non-stationary time series to stationary time series and can be useful for forecasting performance.

Intrinsic mode functions (IMF)

It is any time-varying function with the same number of extrema and zero crossings, whose envelopes are symmetric with respect to zero.

Results

Three datasets are used: 1) Electricity, 2) Exchange rate, 3) Air temperature. All models forecast a one-time step forward.

M. Caron et al., 2018, "Deep clustering for unsupervised learning of visual features"

"DeepCluster" is proposed. It iterates between clustering with k-means the features produced by the ConvNet and updating its weights by predicting the cluster assignments as pseudo-labels in a discriminative loss.

Related Work

A popular form of unsupervised learning, called "self-supervised learning" uses the pretext tasks (up until sometime in 2018). The pretext task-approaches are domain-dependent, while the proposed algorithm in this paper is not.

Weight Optimization

Weights of a feature extractor and a linear classifier are learned by optimizing the following equation:

$$ \mathrm{min}_{\theta, W} \frac{1}{N} \sum_{n=1}^N \mathcal{l} (g_W (f_ {\theta}(x_n), y_n ) \tag{1} $$

where $\theta$, $W$ are the weights of the feature extractor and linear classifier, $x_n, y_n$ denote each image and corresponding label in $\{ 0, 1 \}^k$, respectively. This label represents the image's membership to one of $k$ possible predefined classes. $f, g$ denote the feature extractor and linear classifier, respectively. $\mathcal{l}$ denotes the multinomial logistic loss, also known as the negative log-softmax function.

Unsupervised Learning by Clustering

The authors cluster the output of the convnet and use the subsequent cluster assignments as "pseudo-labels" to optimize $Eq.(1)$. This deep clustering (DeepCluster) approach iteratively learns the features and groups them.

The $k$-means takes a set of vectors as input, in our case the features $f_{\theta}(x_n)$ produced by the convnet, and clusters them into $k$ distinct groups based on a geometric criterion. More precisely, it jointly learns a $d \times k$ centroid matrix $C$ and the cluster assignments $y_n$ of each image $n$ by solving the following problem:

Solving this problem provides a set of optimal assignments $(y_n^*)_{n <= N} $ and a centroid matrix $C^*$. These assignments are then used as pseudo-labels.

Overall, DeepCluster alternates between clustering the features to produce pseudo-labels using $Eq.(2)$ and updating the parameters of the convnet by predicting these pseudo-labels using $Eq.(1)$.

Avoiding Trivial Solutions

In this section, the authors briefly describe the causes of these trivial solutions and give simple and scalable workarounds.

Empty clusters

A discriminative model learns decision boundaries between classes. An optimal decision boundary is to assign all of the inputs to a single cluster (distance is minimized to zero by all the inputs being in the same class). This issue is caused by the absence of mechanisms to prevent the empty clusters. A common trick used in feature quantization consists in automatically reassigning empty clusters during the $k$-means optimization. More precisely, when a cluster becomes empty, we randomly select a non-empty cluster and use its centroid with a small random perturbation as the new centroid for the empty cluster. We then reassign the points belonging to the non-empty cluster to the two resulting clusters.

Trivial parameterization

If the vast majority of images are assigned to a few clusters, the parameters $\theta$ will exclusively discriminate between them. In the most dramatic scenario where all but one cluster are singleton, minimizing $Eq.(1)$ leads to a trivial parameterization where the convnet will predict the same output regardless of the input. This issue also arises in supervised classification when the number of images per class is highly unbalanced. A strategy to circumvent this issue is to sample images based on a uniform distribution over the classes, or pseudo-labels. This is equivalent to weight the contribution of an input to the loss function in $Eq.(1)$ by the inverse of the size of its assigned cluster.

M. Caron et al., 2021, Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (SwAV)

SwAV takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, this method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or "views") of the same image, instead of comparing features directly as in contrastive learning. Simply put, the authors use a "swapped" prediction mechanism where they predict the code (i.e., cluster assignments) of a view from the representation of another view.

Introduction

Many recent SOTA methods build upon the "instance discrimination task" that considers each image of the dataset (or "instance") and its transformations as a separate class (Z. Wu et al., 2018, "Unsupervised Feature Learning via Non-Parametric Instance Discrimination"). This task yields representations that are able to discriminate between different images, while achieving some invariance to image transformations. The recent self-supervised methods that use the instance discrimination rely on a combination of the following two elements: 1) a contrastive loss, 2) a set of image transformations.

The contrastive loss explicitly compares pairs of image representations to push away representations from different images while [pulling together those from transformations, or views, of the same image. Since computing all the pairwise comparisons on a large dataset is not practical, most implementations approximate the loss by reducing the number of comparisons to random subsets of images during training (Z. Wu et al., 2018, "Unsupervised Feature Learning via Non-Parametric Instance Discrimination").

An alternative to approximate the loss is to approximate the task - that is to relax the instance discrimination problem. For example, "clustering-based methods" discriminate between groups of images with similar features instead of individual images (M. Caron et al., 2018, "Deep Clustering for Unsupervised Learning of Visual Features"). The objective in clustering is tractable, but it does not scale well with the dataset as it requires a pass over the entire dataset to form image "codes" (i.e., cluster assignments) that are used as targets during training.

In this paper, the authors use a different paradigm and propose to compute the codes online while enforcing consistency between codes obtained from views of the same image. Comparing cluster assignments allows to contrast different image views while not relying on explicit pairwise feature comparisons. Specifically, the authors propose a simple "swapped" prediction problem where they predict the code of a view from the representation of another view. The representation features are learned by Swapping Assignments between multiple Views of the same image (SwAV). The feature and the codes are learned online allowing this method to scale to potentially unlimited amounts of data. In addition, It does not need any large memory bank.

The authors also propose an image transformation, called "multi-crop". It uses smaller-sized images to increase the number of views while not increasing the memory or computational requirements during training. Directly working with downsized images introduces a bias of the features, which can be avoided by using a mix of different sizes.

Related Work

Instance and contrastive learning

Instance-level classification considers each image in a dataset as its own class. In contrast to this line of works, the authors in this paper avoid comparing every pair of images by mapping the image features to a set of trainable prototype vectors.

Clustering for deep representation learning

This paper's work is related to clustering-based methods. [M. Caron, 2018, DeepCluster] shows that k-means assignments can be used as pseudo-labels to learn visual representations. In this paper, the authors keep the soft assignment produced by the Sinkhorn-Knopp algorithm instead of approximating it into a hard assignment. Besides, unlike [M. Caron, 2018], the authors obtain online assignments which allows the proposed method to scale to any dataset size.

Method

Typical clustering-based methods are offline in the sense that they alternate between a cluster assignment step where image features of the entire dataset are clustered, and a training step where the cluster assignments, i.e., "codes" are predicted for different image views. In the proposed method, the authors do not consider the codes as a target, but only enforce consistent mapping between views of the same image. The method can be interpreted as a way of contrasting between multiple image views by comparing their cluster assignments instead of their features.

More precisely, we compute a code from an augmented version of the image and predict this code from other augmented versions of the same image. Given two image features $\boldsymbol{z}_t$ and $\boldsymbol{z}_s$ from two different augmentations of the same image, we compute their codes $\boldsymbol{q}_t$ and $\boldsymbol{q}_s$ y matching these features to a set of $K$ prototypes $\{ \boldsymbol{c}_1, \cdots, \boldsymbol{c}_K \}$. We then setup a "swapped" prediction problem with the following loss function:

$$ L(\boldsymbol{z}_t, \boldsymbol{z}_s) = \mathcal{l}(\boldsymbol{z}_t, \boldsymbol{q}_s) + \mathcal{l}(\boldsymbol{z}_s, \boldsymbol{q}_t) $$

where the function \mathcal{l}(\boldsymbol{z}, \boldsymbol{q}) measures the fit between features $\boldsymbol{z}$ and a code $\boldsymbol{q}$. Intuitively, this method compares the features $\boldsymbol{z}_t$ and $\boldsymbol{z}_s$ using the intermediate codes $\boldsymbol{q}_t$ and $\boldsymbol{q}_s$. If these two features capture the same information, it should be possible to predict the code from the other feature.

Online clustering

Each image $\boldsymbol{x}_n$ is transformed into an augmented view $\boldsymbol{x}_{nt}$ by applying a transformation $t$ sampled from the set $\mathcal{T}$ of image transformations. The augmented view is mapped to a vector representation by applying a non-linear mapping $f_{\theta}$ to $\boldsymbol{x}_{nt}$. The feature is then projected to the unit sphere, i.e., $\boldsymbol{z}_{nt} = f_{\theta}(\boldsymbol{x}_{nt}) / || f_{\theta}(\boldsymbol{x}_{nt}) ||_2 $. We then compute a code $\boldsymbol{q}_{nt}$ from tihs feature by mapping $\boldsymbol{z}_{nt}$ to a set of $K$ trainable prototypes vectors, ${ \boldsymbol{c}_1, \cdots, \boldsymbol{c}_K }$. We denote by $C$ the matrix whose columns are the $\boldsymbol{c}_1, \cdots, \boldsymbol{c}_k$. We now describe how to compute these codes and update the prototypes online.

Swapped prediction problem

The loss function in $Eq.(1)$ has two terms that setup the "swapped" prediction problem of predicting the code $\boldsymbol{q}_t$ from the feature $\boldsymbol{z}_s$, and $\boldsymbol{q}_s$ from $\boldsymbol{z}_t$. Each term represents the cross-entropy loss between the code and the probability obtained by taking a softmax of the dot products of $\boldsymbol{z}_i$ and all prototypes in $C$, i.e.,

where $\tau$ is a temperature parameter. It controls the concentration level of the distribution such that .It controls the concentration level of the distribution such that the larger temperature ($\tau > 1$) smoothes out the concentration, and the lower temperature ($\tau < 1$) intensifies the concentration. Taking this loss over all the images and pairs of data augmentations leads to the following loss function for the swapped prediction problem:

This loss function is jointly minimized with respect to the prototypes $C$ and the parameters $\theta$ of the image encoder $f_{\theta}$ used to produce the features $(\boldsymbol{z}_{nt})_{n,t}$

Multi-crop: Augmenting views with smaller images

We use two standard resolution crops and sample $V$ additional low-resolution crops that cover only small parts of the image. Using low-resolution images ensures only a small increase in the compute cost. Specifically, we generalize the loss of $Eq.(1)$:

Handcrafted pretext tasks

this video

Exemplar, 2014

Relative Patch Location, 2015

LRN: Local Response Normalization (= a kind of regularization layer)

Jigsaw Puzzles, 2016

Autoencoder-Base Approaches

Count, 2017

Multi-task, 2017

Combining several pretext tasks (= Relative path location + Colorization + Exemplar + Motion Segmentation)

Rotations, 2018

K. He et al., 2019, "Momentum contrast for unsupervised visual representation learning" (MoCo)

1. Introduction

The "keys" in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded "query" should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss.

Momentum Contrast (MoCo) is a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss. We maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The queue decouples the dictionary size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.

MoCo is a mechanism for building dynamic dictionaries for contrastive learning, and can be used with various pretext tasks. In this paper, we follow a simple instance discrimination task: a query matches a key if they are encoded views (e.g., different crops) of the same image.

3. Method

3.1. Contrastive Learning as Dictionary Look-up

Contrastive learning, and its recent developments, can be thought of as training an encoder for a dictionary look-up task, as described next.

Consider an encoded query $q$ and a set of encoded samples $\{ k_0, k_1, k_2, \cdots \}$ that are the keys of a dictionary. Assume that there is a single key (denoted as $k_+$) in the dictionary that $q$ matches. A contrastive loss is a function whose value is low when $q$ is similar to its positive key $k_+$ and dissimilar to all other keys (considered negative keys for $q$). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE (A. Oord et al., 2018, "Representation learning with contrastive predictive coding"), is considered in this paper:

$$ \mathcal{L}_q = - \log \frac{ \mathrm{exp}(q \cdot k_+ / \tau) }{ \sum_{i=0}^K \mathrm{exp}( q \cdot k_i / \tau ) } $$

where $\tau$ is a temperature hyper-parameter (In this paper, it's set to 0.07). It controls the concentration level of the distribution such that the larger temperature ($\tau > 1$) smoothes out the concentration, and the lower temperature ($\tau < 1$) intensifies the concentration. The sum is over one positive and $K$ negative samples. Intuitively, this loss is the log loss of a $(K+1)$-way softmax-based classifier that tries to classify $q$ as $k_+$ (i.e., log loss of softmax is equal to the cross entropy loss implemented in PyTorch).

The contrastive loss serves as an unsupervised objective function for training the encoder networks that represent the queries and keys. In general, the query representation is $q = f_q (x^q)$ where $f_q$ is an encoder network and $x^q$ is a query sample (likewise, $k = f_k (x^k)$). Their instantiations depend on the specific pretext task. The input $x^q$ and $x^k$ can be images, patches, or context consisting a set of patches. The network $f_q$ and $f_k$ can be identical, partially shared, or different (In case of MoCo, they are identical with some time delay).

3.2. Momentum Contrast

Our hypothesis is that good features can be learned by a large dictionary that covers a rich set of negative samples, while the encoder for the dictionary keys is ket as consistent as possible despite its evolution. Based on this motivation, the authors present Momentum Contrast.

Dictionary as a queue

At the core of our approach is maintaining the dictionary as a queue of data samples. This allows us to reuse the encoded keys from the immediate preceding mini-batches. The introduction of a queue decouples the dictionary size from the mini-batch size. Our dictionary size can be much larger than a typical mini-batch size, and can be flexibly and independently set as a hyper-parameter. The samples in the dictionary are progressively replaced. The current mini-batch is enqueued to the dictionary, and the oldest mini0batch in the queue is removed. The dictionary always represents a sampled subset of all data, while the extra computation of maintaining this dictionary is manageable. Moreoever, removing the oldest mini-batch can be beneficial, because its encoded kyes are the most outdated and thus the least consistent with the newest ones.

Momentum update

Formally, denoting the parameters of $f_k$ as $\theta_k$ and those of $f_q$ as $\theta_q$, we update $\theta_k$ by:

$\theta_k \leftarrow m \theta_k + (1-m) \theta_q \tag{2}$

Here $m \in [0,1)$ is a momentum coefficient. Only the parameters $\theta_q$ are updated by back-propagation. The momentum update in $Eq.(2)$ makes $\theta_k$ evolve more smoothly than $\theta_q$. As a result, though the kyes in the queue are encoded by different encoders (in different mini-batches), the difference among these encoders can be made small. In experiments, a relatively large momentum (e.g., $m = 0.999$, our default) works much better than a smaller values (e.g., $m = 0.9$), suggesting that a slowly evolving key encoder is a core to making use of a queue.

Relation to previous mechanisms

MoCo is a general mechanism for using contrastive losses. We compare it with two existing general mechanisms in the following figure. We compare it with two existing general mechanisms. They exhibit different properties on the dictionary size and consistency:

The end-to-end update by back-propagation is a natural mechanism (Fig. 2a). It uses samples in the current mini-batch as the dictionary, so the keys are consistently encoded (by the same set of encoder parameters). But the dictionary size is coupled with the mini-batch size, limited by the GPU memory size.

Another mechanism is the memory bank approach proposed by [Z. Wu, 2018, "Unsupervised feature learning via non-parametric instance discrimination"]. A memory bank consists of the representations of all samples in the dataset. The dictionary for each mini-batch is randomly sampled from the memory bank with no back-propagation, so it can support a large dictionary size. However, the representation of a sample in the memory bank was updated when it was last seen, so the sampled keys are essentially about the encoders at multiple different steps all over the past epoch and thus are less consistent.

3.3. Pretext Task

Contrastive learning can drive a variety of pretext tasks. As the focus of this paper is not on designing a new pretext task, we use a simple one mainly following the instance discrimination task (Z. Wu, 2018, "Unsupervised feature learning via non-parametric instance discrimination").

We consider a query and a key as a positive pair if they originate from the same image, and otherwise as a negative sample pair. We take two random "views" of the same image under random data augmentation to form a positive pair. The queries and keys are respectively encoded by their encoders, $f_q$ and $f_k$.

Algorithm 1 provides the pseudo-code of MoCo for this pretext task. For the current mini-batch, we encode the queries and their corresponding keys, which form the positive sample pairs. The negative samples are from the queue.

Technical details

ResNet is used as the encoder (feature extractor) and its last layer is a global average pooling layer. After that, it's followed by a fixed-dimensional output (128-D). This output vector is normalized by its L2-norm.

BN deteriorates the representation learning

In experiments, we found that using BN prevents the model from learning good representations, as similarly reported in [O. Henaff, "Data-efficient image recognition with contrastive predictive coding"] where they avoided using BN. The model appears to "cheat" the pretext task and easily finds a low-loss solution. This is possibly because the intra-batch communication among samples (caused by BN) leaks information.

In this paper, they solved the problem by shuffling the sample order in the current mini-batch before distributing it among GPUs (and shuffle back after encoding).

4. Results

Comparison under the linear classification protocol on ImageNet.

T. Chen et al., 2020, "A simple framework for contrastive learning of visual representations" (SimCLR)

SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. we systematically study the major components of the SimCLR framework. We show that

Composition of data augmentations plays a critical role in defining effective predictive tasks that yield effective representations.
Using the MLP projection head substantially improves the quality of the learned representations.
Representation learning with contrastive cross-entropy loss (i.e., InfoNCE) benefits from L2-normalized embedding and an appropriately adjusted temperature parameter.
Contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. (Yet, you have to be aware that SimCLR requires a large batch size to obtain a large number of negative pairs since it takes the end-to-end learning without a memory bank. The large batch size is not really affordable for "ordinary" devices by memory-wise).

2. Method

2.1. The Contrastive Learning Framework

This SimCLR framework comprises the following major components:

we sequentially apply three simple augmentations: random cropping followed by resize back to the original size, random color distortions, and random Gaussian blur.
ResNet is adopted for the encoder $f(.)$ to obtain $\boldsymbol{h}_i = f(\hat{ \boldsymbol{x} }_i )$. where $\boldsymbol{h}_i \in \mathbb{R}^d$ is the output after the average pooling layer.
two layers of MLP projection head $g(.)$ is used that maps a 2048-dimensional feature to a 128-dimensional latent space.
The same contrastive loss used in [Z. Wu, 2018, "Unsupervised feature learning via non-parametric instance discrimination"] and [A. Oord, 2018, "Representation learning with contrastive predictive coding"] is used.

We randomly sample a minibatch of $N$ examples and define the contrastive prediction task on pairs of augmented examples derived from the minibatch, resulting in $2N$ data points. We do not sample negative examples explicitly. INstead, given a positive pair, we treat the other $2(N-1)$ augmented examples within a minibatch as negative examples. Let $\mathrm{sim}(\boldsymbol{u}, \boldsymbol{v}) = \boldsymbol{u}^{\mathrm{T}} \boldsymbol{v} / ||\boldsymbol{u}|| ||\boldsymbol{v}|| $ denote the dot product between $l_2$ normalized $\boldsymbol{u}$ and $\boldsymbol{v}$ (i.e. cosine similarity). Then, hte loss function for a positiv pair of examples $(i, j)$ is defined as

where $\mathbb{1}_{[k \neq i]} \in \{ 0, 1 \}$ is an indicator function evaluating to 1 if $k \neq i$ and $\tau$ denotes a temperature parameter. The final loss is computed across all positive pairs, both $(i,j)$ and $(j,i)$, in a mini-batch.

Note that loss is symmetrical between two views.

2.2. Training with Large Batch Size

To keep it simple, we do not train the model with a memory bank. Instead, we use a large bath size. To support the large batch size, the authors used Cloud TPUs, using 32 to 128 cores depending on the batch size. SimCLR is not really efficient in terms of memory.

Global BN

In distributed training with data parallelism, the BN mean and variance are typically aggregated locally per device. However, the BN tends to deteriorate the quality of learned representations. We address this issue by aggregating BN mean and variance over all devices during the training. Other approaches include shuffling data examples across devices (MoCo), or replacing BN with layer norm (Henaff et al., 2019).

Default setting

We use learning rate warmup (linear warmup) for the first 10 epochs, and decay the learning rate with the cosine decay schedule without restarts.

Results

(Right) Use of the MLP projection head matters.

$\tau$ seems to result in a good performance when it's 0.1 or 0.2 (considering the SimCLR or MoCo-v2 papers). Note that Entropy denotes data impurity.

Large batch size results in the better performance (i.e., more negative pairs). Yet, it's not really affordable memory-wise.

X. Chen et al., 2020, "Improved baselines with momentum contrastive learning" (MoCo v2)

Introduction

The authors adopt two of SimCLR's design improvements to upgrade MoCo-v1 to MoCo-v2: (1) MLP projection head, (2) Stronger data augmentation.

The MoCo framework can process a large set of negative samples without requiring large training batches as shown in the figure below. In contrast to SimCLR's large 4k~8k batches which require TPU support, the "MoCo-v2" can run on a typical 8-GPU machine and achieve better results than SimCLR.

SimCLR follows Fig. 1(a); MoCo follows Fig.1(b).

Results

The MLP projection head improves the performance a lot. Longer epochs should be noted too. Yet, the cosine learning rate schedule is not much of a help.

The 4k batch sie is intractable even in a high-end 8-GPU machine. Also, under the same batch ie of 256, the end-to-end variant is still more costly in memory and time, because it back-propagates to both $q$ and $k$ encoders, while MoCo back-propagates to the $q$ encoder only.

J. Grill et al., 2020, "Bootstrap Your Own Latent A New Approach to Self-Bootstrap Your Own Latent A New Approach to Self-Supervised Learning" (BYOL)

BYOL relies on two neural networks, referred to as online and target networks. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While SOTA methods rely on negative pairs, BYOL achieves a new SOTA with them.

1. Introduction

The contrastive learning-based approaches with negative pairs need careful treatment of negative pairs by either relying on large batch sizes, memory banks.

BYOL achieves higher performance than SOTA contrastive methods without using negative pairs. It iteratively bootstraps (the term bootstrap here is used in its idiomatic sense rather than the statistical sense.) the outputs of a network to serve as targets for an enhanced representation. Moreover, BYOL is more robust to the choice of image augmentations than contrastive methods. We suspect that no relying on negative pairs is one of the leading reasons for its improved robustness. Starting from an augmented view of an image, BYOL trains its online network to predict the target network's representation of another augmented view of the same image. While this objective admits collapsed solutions, e.g., outputting the same vector for all images, we empirically show that BOYL does not converge to such solutions.

We show that BOYL is more resilient to changes in the batch size and in the set of image augmentations compared to its contrastive counterparts.

2. Related work

The idea of a slow-moving average target network to produce stable targets for the online network was inspired by deep RL such as DQN.

Difference between SimCLR and BYOL w.r.t the slow-moving average network: In self-supervised learning, MoCo uses a slow-moving average network (momentum encoder)to maintain consistent representations of negative pairs drawn from a memory bank. INstead, BOYL uses a moving average network to produce prediction targets as a means of stabilizing the bootstrap step.

3. Method

Many successful self-supervised learning approaches build upon the cross-view prediction framework. Typically, these approaches learn representations by predicting different views (e.g. different random crops) of the same image from one another. Many such approaches cast the prediction problem directly in representation space. However, predicting directly in representation space can lead to collapsed representations. Contrastive methods circumvent this problem by reformulating the prediction problem into one of discrimination: from the representation of an augmented view, they learn to discriminate between the representation of another augmented view of the same image, and the representations of augmented views of different images. Yet, this discriminative approach typically requires comparing each representation of an augmented view with many negative examples, to find ones sufficiently close to make the discrimination task challenging.

In case of BYOL: from a given representation, referred to as target, we can train a new, potentially enhanced representation, referred to as online, by predicting the target representation. From there, we can expect to build a sequence of representations of increasing quality by iterating this procedure, using subsequent online networks as new target networks for further training.

3.1. Description of BYOL

BYOL's goal is to learn a representation $y_{\theta}$ which can then be used for downstream tasks. The online network is defined by a set of $\theta$ and is comprised of three stages: an encoder $f_{\theta}$, a projector $g_{\theta}$, and a predictor $q_{\theta}$. The target network has the same architecture as the online network, but uses a different set of weights $\xi$. The target network provides the regression targets to train the online network, and its parameters $\xi$ are an exponential moving average of the online parameters $\theta$. More precisely, given a target decay rate $\tau \in [0,1]$, after each training step, we perform the following update,

$$ \xi \leftarrow \tau \xi + (1 - \tau) \theta $$

We $l_2$-normalize $q_{\theta}(z_{\theta})$ and $z_{\xi}^\prime$ and yield $\bar{q_\theta}$ and $\bar{z_\xi}^\prime$. Note that this predictor is only applied to the online branch, making the architecture asymmetric between the online and target pipeline. Finally, we define the following mean squared error between the normalized predictions and target projections:

We symmetrize the loss $\mathcal{L}_{\theta, \xi}$ b separately feeding $v^\prime$ to the online network and $v$ to the target network to compute $\tilde{\mathcal{L}}_{\theta, \xi}$. At each training step, we perform a stochastic optimization step to minimize $\mathcal{L}_{\theta, \xi}^{\mathrm{BYOL}} = \mathcal{L}_{\theta, \xi} + \tilde{\mathcal{L}}_{\theta, \xi} $ with respect to $\theta$ only, but not $\xi$, as depicted by the stop-gradient (sg) in Figure 2. BYOL's dynamics are summarized as:

where $\eta$ is a learning rate.

At the end of training, we only keep the encoder $\f_{\theta}$ as in the MoCo paper.

3.2 Intuitions on BYOL's behavior

As BYOL does not use an explicit erm to prevent collapse (such as negative pairs) while minimizing $\mathcal{L}_{\theta, \xi}^{BYOL}$ with respect to $\theta$, it may seem that BYOL should converge to a minimum of this loss with respect to $(\theta, \xi)$ (e.g., a collapsed constant representation). However, BOYL's target parameters $\xi$ updates are not in the direction of $\nabla_\xi \mathcal{L}_{\theta, \xi}^{BYOL}$. More generally, we hypothesize that there is no loss $L_{\theta, \xi}$ such that BYOL's dynamics is a gradient descent on $L$ jointly over $\theta, \xi$. This way, the two representations are not forced to be the same (=> no worries for the collapsed constant representation).

When assuming BOYL's predictor $q_\theta$ to be optimal i.e., $q_\theta = q^*$ with

3.3 Implementation details

Architecture

The representation $y$ corresponds to the output of the final average pooling layer, which has a feature dimension of 2048. The representation $y$ is projected to a smaller space by an MLP $g_\theta$, and similarly for the target projection $g_\xi$. This MLP consists of a linear layer with output size 4096 followed by BN and ReLU, and a final linear layer with output dimension 256.

4. Results

(Left) It shows that BOYL is robust in terms of batch size, unlike SimCLR, (Right) BYOL is less affected by the image augmentations.

Ablation at 300 epochs under liner evaluation on ImagNet. The target decay rate should be appropriately set.

Weight decay: We note that removing the weight decay in either BYOL or SimCLR leads to network divergence, emphasizing the need for weight regularization in a self-supervised setting. In the SimCLR and BYOL frameworks, the weight decay of $10^{-6}$ is used.

X Chen et al., 2020, "Exploring simple siamese representation learning" (Simple Siamese)

H. Liu et al., 2020, "FROST: Faster and more robust one-shot semi-supervised training"

Two previous SOTA methods that have been successful for one-shot semi-supervised learning

FixMatch and BOSS

FixMatch was the first paper to present results for one-shot semi-supervised learning, which was followed up by the BOSS method.

Other semi-supervised methods

Other semi-supervised methods such as the $\pi$ model, temporal ensembling, VAT, ICT, UDA, $S^4L$, Mix-Match, and Mean Teachers could not demonstrate results with less than 25 labeled examples per class.

Semi-supervised learning

Semi-supervised learning uses a small amount of labeled data to define the desired task and uses a large amount of unlabeled data to avoid overfitting the labeled data.

Consistency regularization (= consistency training) v.s Contrastive learning

It utilizes unlabeled data by relying on the assumption that the model should output these same final predictions when fed perturbed versions as on the original image. In contrastive representation learning, the assumption is that the model's internal representations should be similar when fed perturbed versions as on the original image and these representations should be less similar than the representation of other images.

BOSS - Class balancing ★

BOSS borrowed techniques from the literature on training with imbalanced data (i.e. some classes having many more training samples than other classes). An important difference of the semi-supervised learning from the data imbalanced domains is the lack of ground truth as to what are the majority and minority classes when applying these techniques to unlabeled data. The authors of BOSS proposed using the model's pseudo-labels as a means to count each class and implicitly assumed that the unlabeled dataset is class balanced.

Class balancing methods are typically one of two types: 1) data-level, 2) algorithm-level. The data-level methods oversample the minority classes and undersample the majority classes. The algorithm-level methods improve the imbalance by using larger weights in the loss function for training samples that belong to the minority classes, as well as smaller weights for the samples from the majority classes.

Then, the count of the set of pseudo-labels was used to conduct the class balancing. The BOSS includes the data-level, algorithm-level, and hybrid methods.

Pseudo-labeling

It defines a base model that is trained on labeled data and the model is used to predict labels for unlabeled data.

Self-training

It is a basic approach of pseudo-labeling where a classifier is trained with an initially small number of labeled examples, aiming to classify unlabeled points, and then it is retrained with its own most confident predictions, thus enlarging its labeled training set.

How is FROST different from the previous semi-supervised learning methods?

FROST differs from the previous methods by utilizing a semi-supervised base learner in place of the supervised base learning.

Loss function in FROST

A weighted sum of a supervised loss and an unsupervised loss is used, in which the unsupervised loss term is for consistency regularization.

$$ J = \lambda_s J_s + \lambda_u J_u $$

where the subscripts $s$ and $u$ denote supervised and unsupervised, respectively. $\lambda$ and $J$ denote a scalar hyperparameter and loss, respectively.

Xinlei Chen and Kaiming He, 2020, Exploring Simple Siamese Representation Learning (SimSiam)

Collapsed representation

It refers to the phenomenon that all representations collapse into a constant. To solve this problem, contrastive learning was initially developed.

Application of Siamese networks

The applications include signature and face verification, tracking, one-shot learning, and others.

Negative samples in contrastive learning

In practice, contrastive learning methods benefit from a large number of negative samples. These samples can be maintained in a memory bank. In a Siamese network, MoCo [Kaiming He et al., 2019, Momentum Contrast for Unsupervised Visual Representation Learning] maintains a queue of negative samples and turns one branch into a momentum encoder to improve consistency of the queue. SimCLR directly uses negative samples coexisting in the current batch, and it requires a large batch size to work well.

Common experimental setup protocol for unsupervised learning

We do unsupervised pre-training on the (1000-class ImageNet) training set without using labels. The quality of the pre-trained representations is evaluated by training a supervised linear classifier on frozen representations in the training set, and then testing it in the validation set which is a common protocol.

kNN classifier serving as a monitor of the progress

Originally, this idea is from the memory bank paper:

In the ImageNet dataset, all the images have labels. We just train a model without labels for unsupervised learning. The above figure can be created by finding neighboring representations nearby a test image's representation and compute:

$$ \frac{\# \; matched \; labels}{k} $$

where $k$ is a number of neighbors considered.

Categories of unsupervised learning for the representation learning

Unsupervised learning for representation learning is categorized into 1) Generative, 2) Discriminative.

Generative approaches to representation learning build a distribution over data and latent embedding and use the learned embeddings as image representations. Many of these approaches rely either on auto-encoding of images or on adversarial learning. The generative methods typically operate directly in pixel space. This, however, is computationally expensive, and the high level of detail required for image generation may not be necessary for representation learning.

Among the discriminative methods, contrastive methods currently achieve SOTA performance in self-supervised learning. The contrastive approaches avoid a costly generation step in pixel space by bringing representation of different views of the same image closer ('positive pairs'), and spreading representations of views from different images ('negative pairs') apart. The contrastive methods often require comparing each example with many other examples to work well.

Some self-supervised methods are not contrastive but rely on using auxiliary handcrafted prediction tasks to learn their representation (i.e. pretext tasks). In particular, relative path prediction, colorizing gray-scale images, image inpainting, image jigsaw puzzle, image super-resolution, and geometric transformations have been shown to be useful.

Transfer to other classification tasks

In the evaluation step, the authors evaluated the representation on other classification datasets to assess whether the features learned on ImageNet are generic and thus useful across image domains. They performed the linear evaluation and fine-tuning on the same set of classification tasks. The resulting table from the paper is shown in the following figure:

저작자표시 (새창열림)

'Paper review' 카테고리의 다른 글

[2021.04.12] Barlow Twins (0)	2021.04.15
[2021.03.15] Few-shot Learning; Self-supervised Learning (0)	2021.03.15
[2021.03.09] Few-shot Learning; Self-supervised Learning (0)	2021.03.11

Comments

ABOUT ME

R. Caruana, 1997, "Multitask Learning"

Multi-task learning (MTL)

Z. Wu et al., 2018, "Unsupervised Feature Learning via Non-Parametric Instance Discrimination"

Non-Parametric Classifier

Memory bank

Semi-supervised learning

Results

A. Van Den Oord et al., 2019, "Representation Learning with Contrastive Predictive Coding" (CPC)

Results

Long H. Nguyen et al., 2019, "Self-boosted Time-series Forecasting with multi-task and Multi-view Learning"

Multi-view

Empirical Ensemble Decomposition (EEDM) algorithm

Intrinsic mode functions (IMF)

Results

M. Caron et al., 2018, "Deep clustering for unsupervised learning of visual features"

Related Work

Weight Optimization

Unsupervised Learning by Clustering

Avoiding Trivial Solutions

Empty clusters

Trivial parameterization

M. Caron et al., 2021, Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (SwAV)

Introduction

Related Work

Instance and contrastive learning

Clustering for deep representation learning

Method

Online clustering

Swapped prediction problem

Multi-crop: Augmenting views with smaller images

Handcrafted pretext tasks

Exemplar, 2014

Relative Patch Location, 2015

Jigsaw Puzzles, 2016

Autoencoder-Base Approaches

Count, 2017

Multi-task, 2017

Rotations, 2018

K. He et al., 2019, "Momentum contrast for unsupervised visual representation learning" (MoCo)

1. Introduction

3. Method

3.1. Contrastive Learning as Dictionary Look-up

3.2. Momentum Contrast

Dictionary as a queue

Momentum update

Relation to previous mechanisms

3.3. Pretext Task

Technical details

BN deteriorates the representation learning

4. Results

T. Chen et al., 2020, "A simple framework for contrastive learning of visual representations" (SimCLR)

2. Method

2.1. The Contrastive Learning Framework

2.2. Training with Large Batch Size

Global BN

Default setting

Results

X. Chen et al., 2020, "Improved baselines with momentum contrastive learning" (MoCo v2)

Introduction

Results

J. Grill et al., 2020, "Bootstrap Your Own Latent A New Approach to Self-Bootstrap Your Own Latent A New Approach to Self-Supervised Learning" (BYOL)

1. Introduction

2. Related work

3. Method

3.1. Description of BYOL

3.2 Intuitions on BYOL's behavior

3.3 Implementation details

Architecture

4. Results

X Chen et al., 2020, "Exploring simple siamese representation learning" (Simple Siamese)

H. Liu et al., 2020, "FROST: Faster and more robust one-shot semi-supervised training"

Two previous SOTA methods that have been successful for one-shot semi-supervised learning

Other semi-supervised methods

Semi-supervised learning

Consistency regularization (= consistency training) v.s Contrastive learning

BOSS - Class balancing ★

Pseudo-labeling

Self-training

How is FROST different from the previous semi-supervised learning methods?