ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [2021.04.12] Barlow Twins
    Paper review 2021. 4. 15. 11:06

    J. Zbontar, 2021, "Barlow Twins: Self-Supervised Learning via Redundancy Reduction"

    This paper proposes an objective function that naturally avoids such collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the representation vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.

    This does not require large batches, nor does it require any asymmetric momentum encoders or stop-gradients. Intriguingly, Barlow Twins (BT) strongly benefits from the use of very high-dimensional representations.

    BT distinguishes itself from other methods by its innovative loss function $\mathcal{L}_{\mathcal{BT}}$:

    where $\lambda$ is a positive constant trading off the importance of the first and second terms of the loss, and $\mathcal{C}$ is the cross-correlation matrix computed between the outputs of the two identical networks along the batch dimension:

    $$\mathcal{C} = (Z^A)^\mathrm{T} Z^B$$

    $Z^A \in \mathbb{R}^{N \times D}$ and $Z^B \in \mathbb{R}^{N \times D}$ are two different views of the original view, and they are assumed to be mean-centered with its standard deviation of 1 along the batch dimension.

    Intuitively, the invariance term of the objective makes the representation invariant to the distortions applied. The redundancy reduction term decorrelates the different vector components of the representation.

    Architecture: The encoder consists of a ResNet-50, followed by a projector network. The projector network has three linear layers, each with 8192 output units. The first two layers of the projector are followed by a batch normalization layer and ReLU.

    Optimization: Batch size: 2048 (Yet, it even works well with batches as small as 256); $\lambda$: 5e-3; Weight decay: 1.5e-6.

    Results

     

    (Left) 'Normalization along feature dim.' refers to the common l2-norm practice along with the feature dim. (Right) BT is robust w.r.t batch size while SimCLR requires a large batch size to obtain a lot of negative samples.

     

    (Left) BT's performance increases with the projector's dimension size. (Right) BT doesn't really need 'stop-gradient' or 'predictor' at all.

     

    Comments