Zero-Shot Video Question Answer
34 papers with code • 12 benchmarks • 11 datasets
This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.
Libraries
Use these libraries to find Zero-Shot Video Question Answer models and implementationsDatasets
Most implemented papers
Flamingo: a Visual Language Model for Few-Shot Learning
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.
Mistral 7B
We introduce Mistral 7B v0. 1, a 7-billion-parameter language model engineered for superior performance and efficiency.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways.
MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks
Second, all baggage images are captured by specially-designed multi-view camera system to handle pose variation and occlusion, in order to obtain the 3D information of baggage surface as complete as possible.
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Manual annotation of question and answers for videos, however, is tedious and prohibits scalability.
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens.
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.