Video-based Generative Performance Benchmarking (Consistency)

10 papers with code • 1 benchmarks • 1 datasets

The benchmark evaluates a generative Video Conversational Model with respect to Consistency.

We curate a test set based on the ActivityNet-200 dataset, featuring videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. We develop an evaluation pipeline using the GPT-3.5 model that assigns a relative score to the generated predictions on a scale of 1-5.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video-based Generative Performance Benchmarking (Consistency)

Trend	Dataset	Best Model	Paper	Code	Compare
	VideoInstruct	PLLaVA			See all

Datasets

VideoInstruct

Most implemented papers

Most implemented Social Latest No code

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

zrrskywalker/llama-adapter • 28 Apr 2023

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

Paper
Code

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

pku-yuangroup/chat-univi • • 14 Nov 2023

Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations.

Paper
Code

VideoChat: Chat-Centric Video Understanding

opengvlab/ask-anything • • 10 May 2023

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.

Paper
Code

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

damo-nlp-sg/video-llama • • 5 Jun 2023

We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video.

Paper
Code

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

mbzuai-oryx/video-chatgpt • • 8 Jun 2023

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data.

Paper
Code

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

rese1f/MovieChat • • 31 Jul 2023

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks.

Paper
Code

One For All: Video Conversation is Feasible Without Video Instruction Tuning

farewellthree/BT-Adapter • • 27 Sep 2023

Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.

Paper
Code

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

opengvlab/ask-anything • • 28 Nov 2023

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

Paper
Code

VTimeLLM: Empower LLM to Grasp Video Moments

huangb23/vtimellm • • 30 Nov 2023

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details.

Paper
Code

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

no code yet • arXiv 2024

PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.

Paper
Add Code

Video-based Generative Performance Benchmarking (Consistency)

Benchmarks Add a Result

Datasets

Most implemented papers

Content

Benchmarks

Add a Result