C. Sun et al.

Google Research

Presented by: Nikhil Devraj



  • Representations of video data generally capture only low-level features and not semantic data
Low-level features
  • BERT performs really well on language modeling tasks


  • Combined ASR, Vector Quantization, and BERT to learn high-level features over long time spans in video tasks
  • A first step in the direction of learning high-level joint representations

VideoBERT Flow



  • Pretrained language model used to generate a probability distribution of tokens
  • Obtained by training model on “masking” task

Supervised Learning

  • Expensive to get labeled data
  • Short term events in video data

Unsupervised Learning

  • Learns from unlabeled data
  • Normal approaches used latent variables (i.e. GAN, VAE)
    • differ from BERT

Self-supervised Learning

Self-supervised example

More on self supervised learning

Cross-Modal Learning

  • Synchronized audio and visual signals allow them to supervise each other
  • Use ASR as a source of crossmodal supervision

Instructional Video Datasets

  • Papers used LMs to analyze these videos with manually provided data
  • Datasets were too small


Omitted the rest

You get the principles I’m getting at though right?