Video Question Answering

153 papers with code • 20 benchmarks • 32 datasets

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. Given a video and a question in natural language, the model produces accurate answers according to the content of the video.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Question Answering

Dataset	Best Model	Compare
ActivityNet-QA	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	See all
NExT-QA	VLAP (3B)	See all
MSRVTT-QA	Mirasol3B	See all
STAR Benchmark	VLAP (4 frames)	See all
MVBench	PLLaVA	See all
AGQA 2.0 balanced	GF (sup) - Faster RCNN	See all
iVQA	Text + Text (no Multimodal Pretext Training)	See all
MSRVTT-MC	VIOLETv2	See all
How2QA	Text + Text (no Multimodal Pretext Training)	See all
TVQA	LLaMA-VQA	See all
SUTD-TrafficQA	Tem-adapter	See all
WildQA	Multi (text + video, IO)	See all
LSMDC-MC	VIOLETv2	See all
Howto100M-QA	Hero w/ pre-training	See all
KnowIT VQA		See all
LSMDC-FiB	Clover	See all
MSR-VTT-MC	ATP (1<-16)	See all
DramaQA	LLaMA-VQA	See all
VLEP	LLaMA-VQA	See all
VideoQA	Just Ask (fine-tune)	See all

Show all 20 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Question Answering models and implementations

salesforce/lavis

2 papers

8,724

computer-vision-in-the-wild/cvinw_r…

2 papers

1,001

jpthu17/diffusionret

2 papers

pku-yuangroup/video-bench

2 papers

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Is Space-Time Attention All You Need for Video Understanding?

facebookresearch/TimeSformer • • 9 Feb 2021

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

Paper
Code

Visual Instruction Tuning

haotian-liu/LLaVA • • NeurIPS 2023

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field.

Paper
Code

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo • • DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

Paper
Code

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

vision-cair/minigpt-4 • • 20 Apr 2023

Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts.

Paper
Code

TVQA: Localized, Compositional Video Question Answering

jayleicn/TVQA • • EMNLP 2018

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks.

Paper
Code

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

jpthu17/emcl • • 21 Nov 2022

Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.

Paper
Code

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind • • 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Paper
Code

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

jpthu17/HBI • • CVPR 2023

Contrastive learning-based video-language representation learning approaches, e. g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs.

Paper
Code