Lipreading
30 papers with code • 7 benchmarks • 6 datasets
Lipreading is a process of extracting speech by watching lip movements of a speaker in the absence of sound. Humans lipread all the time without even noticing. It is a big part in communication albeit not as dominant as audio. It is a very helpful skill to learn especially for those who are hard of hearing.
Deep Lipreading is the process of extracting speech from a video of a silent talking face using deep neural networks. It is also known by few other names: Visual Speech Recognition (VSR), Machine Lipreading, Automatic Lipreading etc.
The primary methodology involves two stages: i) Extracting visual and temporal features from a sequence of image frames from a silent talking video ii) Processing the sequence of features into units of speech e.g. characters, words, phrases etc. We can find several implementations of this methodology either done in two separate stages or trained end-to-end in one go.
Libraries
Use these libraries to find Lipreading models and implementationsMost implemented papers
LipNet: End-to-End Sentence-level Lipreading
Lipreading is the task of decoding text from the movement of a speaker's mouth.
Combining Residual Networks with LSTMs for Lipreading
We propose an end-to-end deep learning architecture for word-level visual speech recognition.
Deep Audio-Visual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
End-to-end Audio-visual Speech Recognition with Conformers
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.
End-to-end Audiovisual Speech Recognition
In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models.
LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild
It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.
Lipreading using Temporal Convolutional Networks
We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.
Discriminative Multi-modality Speech Recognition
Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates.
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.
Visual Speech Recognition for Multiple Languages in the Wild
However, these advances are usually due to the larger training sets rather than the model design.