TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Generation	MSR-VTT	VideoAssembler (Zero-Shot, 256x256, class-conditional)	Inception score	15.79	# 1
Video Generation	MSR-VTT	VideoAssembler (Zero-Shot, 256x256, class-conditional)	FVD16	252	# 1
Video Generation	UCF-101	VideoAssembler (Zero-shot, 256x256, class-conditional)	Inception Score	48.01	# 14
Video Generation	UCF-101	VideoAssembler (Zero-shot, 256x256, class-conditional)	FVD16	346.84	# 17

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videoassembler-identity-consistent-video/video-generation-on-msr-vtt)](https://paperswithcode.com/sota/video-generation-on-msr-vtt?p=videoassembler-identity-consistent-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/videoassembler-identity-consistent-video/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=videoassembler-identity-consistent-video)`

VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model

29 Nov 2023 · Haoyu Zhao, Tianyi Lu, Jiaxi Gu, Xing Zhang, Zuxuan Wu, Hang Xu, Yu-Gang Jiang ·

Identity-consistent video generation seeks to synthesize videos that are guided by both textual prompts and reference images of entities. Current approaches typically utilize cross-attention layers to integrate the appearance of the entity, which predominantly captures semantic attributes, resulting in compromised fidelity of entities. Moreover, these methods necessitate iterative fine-tuning for each new entity encountered, thereby limiting their applicability. To address these challenges, we introduce VideoAssembler, a novel end-to-end framework for identity-consistent video generation that can conduct inference directly when encountering new entities. VideoAssembler is adept at producing videos that are not only flexible with respect to the input reference entities but also responsive to textual conditions. Additionally, by modulating the quantity of input images for the entity, VideoAssembler enables the execution of tasks ranging from image-to-video generation to sophisticated video editing. VideoAssembler comprises two principal components: the Reference Entity Pyramid (REP) encoder and the Entity-Prompt Attention Fusion (EPAF) module. The REP encoder is designed to infuse comprehensive appearance details into the denoising stages of the stable diffusion model. Concurrently, the EPAF module is utilized to integrate text-aligned features effectively. Furthermore, to mitigate the challenge of scarce data, we present a methodology for the preprocessing of training data. Our evaluation of the VideoAssembler framework on the UCF-101, MSR-VTT, and DAVIS datasets indicates that it achieves good performances in both quantitative and qualitative analyses (346.84 in FVD and 48.01 in IS on UCF-101). Our project page is at https://gulucaptain.github.io/videoassembler/.

PDF Abstract

Code

Add Remove Mark official

gulucaptain/videoassembler official

Tasks

Add Remove

Denoising

Image to Video Generation

Video Editing

Video Generation

Datasets

UCF101

MSR-VTT

Results from the Paper

Edit

Ranked #1 on Video Generation on MSR-VTT

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Generation	MSR-VTT	VideoAssembler (Zero-Shot, 256x256, class-conditional)	Inception score	15.79	# 1	Compare
Video Generation	MSR-VTT	VideoAssembler (Zero-Shot, 256x256, class-conditional)	FVD16	252	# 1	Compare
Video Generation	UCF-101	VideoAssembler (Zero-shot, 256x256, class-conditional)	Inception Score	48.01	# 14	Compare
Video Generation	UCF-101	VideoAssembler (Zero-shot, 256x256, class-conditional)	FVD16	346.84	# 17	Compare

Methods

Add Remove

Diffusion

Edit Social Preview

VideoAssembler: Identity-Consistent Video Generation with Reference Entities using Diffusion Model

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove