TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	ActivityNet-QA	LLaMA Adapter V2	Accuracy	34.2	# 28
Video Question Answering	ActivityNet-QA	LLaMA Adapter V2	Confidence score	2.7	# 8
Zero-Shot Video Question Answer	ActivityNet-QA	LLaMA Adapter	Confidence Score	2.7	# 13
Zero-Shot Video Question Answer	ActivityNet-QA	LLaMA Adapter	Accuracy	34.2	# 14
Visual Question Answering (VQA)	InfiMM-Eval	LLaMA-Adapter V2	Overall score	30.46	# 6
Visual Question Answering (VQA)	InfiMM-Eval	LLaMA-Adapter V2	Deductive	28.7	# 7
Visual Question Answering (VQA)	InfiMM-Eval	LLaMA-Adapter V2	Abductive	46.12	# 5
Visual Question Answering (VQA)	InfiMM-Eval	LLaMA-Adapter V2	Analogical	22.08	# 5
Visual Question Answering (VQA)	InfiMM-Eval	LLaMA-Adapter V2	Params	7B	# 1
Visual Question Answering	MM-Vet	LLaMA-Adapter v2-7B	GPT-4 score	31.4±0.1	# 70
Visual Question Answering	MM-Vet	LLaMA-Adapter v2-7B	Params	7B	# 1
Zero-Shot Video Question Answer	MSRVTT-QA	LLaMA Adapter-7B	Accuracy	43.8	# 18
Zero-Shot Video Question Answer	MSRVTT-QA	LLaMA Adapter-7B	Confidence Score	2.7	# 15
Zero-Shot Video Question Answer	MSVD-QA	LLaMA Adapter-7B	Accuracy	54.9	# 14
Zero-Shot Video Question Answer	MSVD-QA	LLaMA Adapter-7B	Confidence Score	3.1	# 12
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	LLaMA Adapter	gpt-score	2.03	# 10
Video-based Generative Performance Benchmarking	VideoInstruct	LLaMA Adapter	Correctness of Information	2.03	# 14
Video-based Generative Performance Benchmarking	VideoInstruct	LLaMA Adapter	Detail Orientation	2.32	# 14
Video-based Generative Performance Benchmarking	VideoInstruct	LLaMA Adapter	Contextual Understanding	2.30	# 14
Video-based Generative Performance Benchmarking	VideoInstruct	LLaMA Adapter	Temporal Understanding	1.98	# 12
Video-based Generative Performance Benchmarking	VideoInstruct	LLaMA Adapter	Consistency	2.15	# 14
Video-based Generative Performance Benchmarking	VideoInstruct	LLaMA Adapter	mean	2.16	# 14
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	LLaMA Adapter	gpt-score	1.98	# 8
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	LLaMA Adapter	gpt-score	2.32	# 10
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	LLaMA Adapter	gpt-score	2.30	# 10
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	LLaMA Adapter	gpt-score	2.15	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/visual-question-answering-vqa-on-core-mm)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-core-mm?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=llama-adapter-v2-parameter-efficient-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/llama-adapter-v2-parameter-efficient-visual/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=llama-adapter-v2-parameter-efficient-visual)`

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

28 Apr 2023 · Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao ·

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.

PDF Abstract

Code

Add Remove Mark official

zrrskywalker/llama-adapter official

opengvlab/llama-adapter

↳ Quickstart in

Spaces

5,502

Mind23-2/MindCode-140

Tasks

Add Remove

Instruction Following

Optical Character Recognition (OCR)

Video-based Generative Performance Benchmarking

Video-based Generative Performance Benchmarking (Consistency)

Video-based Generative Performance Benchmarking (Contextual Understanding)

Video-based Generative Performance Benchmarking (Correctness of Information)

Video-based Generative Performance Benchmarking (Detail Orientation))

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video Question Answering

Visual Question Answering

Visual Question Answering (VQA)

Zero-Shot Video Question Answer

Datasets

Visual Genome

ScienceQA DocVQA

MM-Vet

ActivityNet-QA MSRVTT-QA MSVD-QA

InfiMM-Eval VideoInstruct

Results from the Paper

Edit

Ranked #6 on Visual Question Answering (VQA) on InfiMM-Eval

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	ActivityNet-QA	LLaMA Adapter V2	Accuracy	34.2	# 28	Compare
Video Question Answering	ActivityNet-QA	LLaMA Adapter V2	Confidence score	2.7	# 8	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	LLaMA Adapter	Confidence Score	2.7	# 13	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	LLaMA Adapter	Accuracy	34.2	# 14	Compare
Visual Question Answering (VQA)	InfiMM-Eval	LLaMA-Adapter V2	Overall score	30.46	# 6	Compare
			Deductive	28.7	# 7	Compare
			Abductive	46.12	# 5	Compare
			Analogical	22.08	# 5	Compare
			Params	7B	# 1	Compare
Visual Question Answering	MM-Vet	LLaMA-Adapter v2-7B	GPT-4 score	31.4±0.1	# 70	Compare
Visual Question Answering	MM-Vet	LLaMA-Adapter v2-7B	Params	7B	# 1	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	LLaMA Adapter-7B	Accuracy	43.8	# 18	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	LLaMA Adapter-7B	Confidence Score	2.7	# 15	Compare
Zero-Shot Video Question Answer	MSVD-QA	LLaMA Adapter-7B	Accuracy	54.9	# 14	Compare
Zero-Shot Video Question Answer	MSVD-QA	LLaMA Adapter-7B	Confidence Score	3.1	# 12	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	LLaMA Adapter	gpt-score	2.03	# 10	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	LLaMA Adapter	Correctness of Information	2.03	# 14	Compare
			Detail Orientation	2.32	# 14	Compare
			Contextual Understanding	2.30	# 14	Compare
			Temporal Understanding	1.98	# 12	Compare
			Consistency	2.15	# 14	Compare
			mean	2.16	# 14	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	LLaMA Adapter	gpt-score	1.98	# 8	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	LLaMA Adapter	gpt-score	2.32	# 10	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	LLaMA Adapter	gpt-score	2.30	# 10	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	LLaMA Adapter	gpt-score	2.15	# 10	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • GPT-4 • Label Smoothing • Layer Normalization • Linear Layer • LLaMA • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove