Zero-Shot Video Question Answer

34 papers with code • 12 benchmarks • 11 datasets

This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Libraries

Use these libraries to find Zero-Shot Video Question Answer models and implementations

Most implemented papers

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

Mistral 7B

mistralai/mistral-src 10 Oct 2023

We introduce Mistral 7B v0. 1, a 7-billion-parameter language model engineered for superior performance and efficiency.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

PKU-YuanGroup/Video-LLaVA 16 Nov 2023

In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

ahjeongseo/MASN-pytorch CVPR 2017

In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways.

MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks

wuyuejinxia/prcv2019-mvb-renet 26 Jul 2019

Second, all baggage images are captured by specially-designed multi-view camera system to handle pose variation and occlusion, in order to obtain the 3D information of baggage surface as complete as possible.

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

antoyang/FrozenBiLM 16 Jun 2022

Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

pku-yuangroup/chat-univi 14 Nov 2023

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

dvlab-research/llama-vid 28 Nov 2023

Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens.

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo2 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.