Text to Video Retrieval

46 papers with code • 3 benchmarks • 6 datasets

Given a natural language query, find the most relevant video from a large set of candidate videos.

Benchmarks

Add a Result

These leaderboards are used to track progress in Text to Video Retrieval

Dataset	Best Model	Compare
Kinetics-GEB+	FROZEN-revised	See all
MSR-VTT	CLIP4Clip	See all
MSVD-Indonesian	X-CLIP (Cross-Lingual)	See all

Libraries

Use these libraries to find Text to Video Retrieval models and implementations

towhee-io/towhee

4 papers

2,987

Datasets

Subtasks

Partially Relevant Video Retrieval

Most implemented papers

Most implemented Social Latest No code

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time • • ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

Paper
Code

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

ArrowLuo/CLIP4Clip • • 18 Apr 2021

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

Paper
Code

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

antoine77340/MIL-NCE_HowTo100M • • ICCV 2019

In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.

Paper
Code

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

antoine77340/MIL-NCE_HowTo100M • • CVPR 2020

Annotating videos is cumbersome, expensive and not scalable.

Paper
Code

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

papermsucode/mdmmt • • 19 Mar 2021

We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin.

Paper
Code

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

google-research/google-research • • NeurIPS 2021

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Paper
Code

Bridging Video-text Retrieval with Multiple Choice Questions

tencentarc/mcq • • CVPR 2022

As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.

Paper
Code

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

showlab/demovlp • • 15 Mar 2022

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.

Paper
Code

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

bryant1410/fitclip • • 24 Mar 2022

Large-scale pretrained image-text models have shown incredible zero-shot performance in a handful of tasks, including video ones such as action recognition and text-to-video retrieval.

Paper
Code

Revealing Single Frame Bias for Video-and-Language Learning

jayleicn/singularity • • 7 Jun 2022

Training an effective video-and-language model intuitively requires multiple frames as model inputs.

Paper
Code

Text to Video Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result