Photorealistic Video Generation with Diffusion Models

11 Dec 2023  ·  Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama ·

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Prediction Kinetics-600 12 frames, 64x64 W.A.L.T.-L FVD 3.3 # 1
Video Generation Kinetics-600 12 frames, 64x64 W.A.L.T-L FVD 3.3±0.0 # 1
Text-to-Video Generation UCF-101 W.A.L.T 3B FVD16 258.1 # 4
Video Generation UCF-101 W.A.L.T 3B (text-conditional) Inception Score 35.1 # 19
FVD16 258.1 # 10
Video Generation UCF-101 W.A.L.T-XL (class-conditional) FVD16 36±2 # 1

Methods