Sindy Lowe, Peter O’ Connor, Bastiaan S. Veeling, NIPS-2019
The paper proposes a novel self-supervised learning technique which rather than employing end-to-end training, focusses on employing isolated training of various modules of the network using greedy InfoNCE objective. This paper heavily inspires from biological phenomenons. This paper won Honorable Mention Outstanding New Directions Paper Award at NeurIPS 2019.
The main contribution of this paper is its greedy InfoNCE objective that is used for training the model in modules rather than end-to-end. This could lead to huge improvement in the computation time and also lead to overcoming the memory bottleneck. The model learns in a rather self-supervised manner by using mutual information between various patches as supervision for training of patches.
This technique has been highly inspired from various principles of neuroscience with the fact that rather than optimizing a single objective function, brain rather functions in a modular way and optimizes local information.
The self-supervised persona of the model comes from the fact that it tries to maximize mutual information of representations of temporally nearby data. This seems to work because of the presence of slow features in the data that are highly effective for downstream tasks.
As seen from the image, the work is focussed heavily on Contrastive Predictive Coding(CPC) as introduced in Oord et al.. The basic principle behind CPC being maximizing mutual information between representations of temporally nearby patches.
Elucidating on the work in CPC, a summarization of encoding until time t, ct is taken by employing an autoregressive model over representations of input until time t. This ct is then used to compute mutual information loss with future input representations zt+k. Also rather than doing this directly a form of negative sampling is used where a bag of future representations is taken {zt+k, zj1 … zjN-1} with only one positive sample and N-1 ‘negative samples’
Following this idea, the authors suggest Greedy InfoMax which is used to greedily train separate modules in the network. So to do so, first representations are extracted from M-1 module to be passed onto M module, so ztM = GradientBlock(encoding(xtM-1)). The GradientBlock helps for the gradient to not pass backward.
So to train each module separately, the strategy followed in CPC is directly taken up with one modification of rather not choosing autoregressive model to summarize information till time t, rather the encoding at time t is directly used to calculate mutual information between temporally nearby patches which was found as good during implementation.