Video-based Generative Performance Benchmarking (Contextual Understanding)
11 papers with code • 1 benchmarks • 1 datasets
The benchmark evaluates a generative Video Conversational Model with respect to Contextual Understanding.
We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.
Most implemented papers
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.
PLLay: Efficient Topological Layer based on Persistence Landscapes
We propose PLLay, a novel topological layer for general deep learning models based on persistence landscapes, in which we can efficiently exploit the underlying topological features of the input data structure.
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.
VideoChat: Chat-Centric Video Understanding
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video.
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data.
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks.
One For All: Video Conversation is Feasible Without Video Instruction Tuning
Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.
VTimeLLM: Empower LLM to Grasp Video Moments
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.