TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Prediction	Kinetics-600 12 frames, 64x64	W.A.L.T.-L	FVD	3.3	# 1
Video Generation	Kinetics-600 12 frames, 64x64	W.A.L.T-L	FVD	3.3±0.0	# 1
Text-to-Video Generation	UCF-101	W.A.L.T 3B	FVD16	258.1	# 4
Video Generation	UCF-101	W.A.L.T 3B (text-conditional)	Inception Score	35.1	# 19
Video Generation	UCF-101	W.A.L.T 3B (text-conditional)	FVD16	258.1	# 10
Video Generation	UCF-101	W.A.L.T-XL (class-conditional)	FVD16	36±2	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/photorealistic-video-generation-with/video-prediction-on-kinetics-600-12-frames)](https://paperswithcode.com/sota/video-prediction-on-kinetics-600-12-frames?p=photorealistic-video-generation-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/photorealistic-video-generation-with/video-generation-on-kinetics-600-12-frames)](https://paperswithcode.com/sota/video-generation-on-kinetics-600-12-frames?p=photorealistic-video-generation-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/photorealistic-video-generation-with/video-generation-on-ucf-101)](https://paperswithcode.com/sota/video-generation-on-ucf-101?p=photorealistic-video-generation-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/photorealistic-video-generation-with/text-to-video-generation-on-ucf-101)](https://paperswithcode.com/sota/text-to-video-generation-on-ucf-101?p=photorealistic-video-generation-with)`

Photorealistic Video Generation with Diffusion Models

11 Dec 2023 · Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama ·

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Super-Resolution

Text-to-Video Generation

Video Generation

Video Super-Resolution

Datasets

ImageNet

UCF101

Kinetics

Kinetics-600

Results from the Paper

Edit

Ranked #1 on Video Prediction on Kinetics-600 12 frames, 64x64

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Prediction	Kinetics-600 12 frames, 64x64	W.A.L.T.-L	FVD	3.3	# 1	Compare
Video Generation	Kinetics-600 12 frames, 64x64	W.A.L.T-L	FVD	3.3±0.0	# 1	Compare
Text-to-Video Generation	UCF-101	W.A.L.T 3B	FVD16	258.1	# 4	Compare
Video Generation	UCF-101	W.A.L.T 3B (text-conditional)	Inception Score	35.1	# 19	Compare
Video Generation	UCF-101	W.A.L.T 3B (text-conditional)	FVD16	258.1	# 10	Compare
Video Generation	UCF-101	W.A.L.T-XL (class-conditional)	FVD16	36±2	# 1	Compare

Methods

Add Remove

BASE • Diffusion

Edit Social Preview

Photorealistic Video Generation with Diffusion Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove