Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko, CVPR 2020


Learning good image representations is a key challenge in computer vision. Till now most of the methods or rather algorithms for learning image representations have been based on contrastive learning (or methods). As most SOTA contrastive methods are trained by manipulating the distance between representations of different augmented views of the same image (positive pairs and negative pairs), they require careful treatment of negative pairs (i.e., preparation of negative pairs in simple words). In addition, their performance critically depends on the choice of image augmentations. Therefore, one major shortcoming with contrastive methods is the sheer amount of data required for training.

To overcome the difficulties faced due to the need of negative pairs while using contrastive methods, the authors introduce Bootstrap Your Own Latent (BYOL), a new algorithm for self-supervised learning of image representations. It iteratively bootstraps the outputs of a network (target network) to serve as targets for an enhanced representation. In short, BYOL proposes a collaboration between two neural networks: online and target networks. Given an augmented view of an image, BYOL trains its online network to predict the target network’s representation of another augmented view of the same image, i.e., BYOL trains the online and target (indirectly) networks hand in hand, using a moving average (as a means of stabilizing the bootstrap step) of the online network’s parameters as target network’s parameters .

The performance of BYOL on the ImageNet (especially linear evaluation) is comparable to the previous SOTA methods, and when used with wider and deep architecture, even outperforms the existing methods with a decrease (30%) in the number of parameters as well.


Main Contributions

Our two cents