InternVid

Introduced by Wang et al. in InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodAL understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.

Homepage