The Lip Reading in the Wild (LRW) dataset a large-scale audio-visual database that contains 500 different words from over 1,000 speakers. Each utterance has 29 frames, whose boundary is centered around the target word. The database is divided into training, validation and test sets. The training set contains at least 800 utterances for each class while the validation and test sets contain 50 utterances.
170 PAPERS • 7 BENCHMARKS
The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences in-the-wild. The database consists of mainly news and talk shows from BBC programs. Each sentence is up to 100 characters in length. The training, validation and test sets are divided according to broadcast date. It is a challenging set since it contains thousands of speakers without speaker labels and large variation in head pose. The pre-training set contains 96,318 utterances, the training set contains 45,839 utterances, the validation set contains 1,082 utterances and the test set contains 1,242 utterances.
96 PAPERS • 9 BENCHMARKS
The How2 dataset contains 13,500 videos, or 300 hours of speech, and is split into 185,187 training, 2022 development (dev), and 2361 test utterances. It has subtitles in English and crowdsourced Portuguese translations.
73 PAPERS • 2 BENCHMARKS
LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.
48 PAPERS • 6 BENCHMARKS
LRW-1000 has been renamed as CAS-VSR-W1k.* It is a naturally-distributed large-scale benchmark for word-level lipreading in the wild, including 1000 classes with about 718,018 video samples from more than 2000 individual speakers. There are more than 1,000,000 Chinese character instances in total. Each class corresponds to the syllables of a Mandarin word which is composed by one or several Chinese characters. This dataset aims to cover a natural variability over different speech modes and imaging conditions to incorporate challenges encountered in practical applications.
9 PAPERS • 1 BENCHMARK
The MISP2021 challenge dataset is a collection of audio-visual conversational data recorded in a home TV scenario using distant multi-microphones. The dataset captures interactions between several individuals who are engaged in conversations in Chinese while watching TV and interacting with a smart speaker/TV in a living room. The dataset is extensive, comprising 141 hours of audio and video data, which were collected using far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. Notably, this corpus is the first of its kind to offer a distant multimicrophone conversational Chinese audio-visual dataset. Furthermore, it is also the first large vocabulary continuous Chinese lip-reading dataset specifically designed for the adverse home-TV scenario.
3 PAPERS • NO BENCHMARKS YET