papers_we_read

Learning Video Representations from Large Language Models

Yue Zhao, Ishan Misra, Philipp Krähenbüh, Rohit Girdhar, Facebook AI Research- Meta AI, University of Texas, Austin 2022

Summary

The approach proposed in this paper, LaViLa (Language-model augmented Video-Language), uses Large Language Models that are repurposed to be conditioned on visual input and then fine tunes it to create automatic visual narrators that generate dense textual descriptions of videos. These generated descriptions are then used to contrastively train Dual Encoder models with learned video text embeddings, which outperforms existing models.

Contributions

Method


Results

LaViLa outperforms the previous state-of-the-art video-language pretraining methods on different datasets such as EK100 MIR, Charades-Ego, EGTEA etc.
The evaluation is done through several protocols, and the approach outperforms the previous SOTA in all cases.

Two-Cents

Resources