Lipreading

30 papers with code • 7 benchmarks • 6 datasets

Lipreading is a process of extracting speech by watching lip movements of a speaker in the absence of sound. Humans lipread all the time without even noticing. It is a big part in communication albeit not as dominant as audio. It is a very helpful skill to learn especially for those who are hard of hearing.

Deep Lipreading is the process of extracting speech from a video of a silent talking face using deep neural networks. It is also known by few other names: Visual Speech Recognition (VSR), Machine Lipreading, Automatic Lipreading etc.

The primary methodology involves two stages: i) Extracting visual and temporal features from a sequence of image frames from a silent talking video ii) Processing the sequence of features into units of speech e.g. characters, words, phrases etc. We can find several implementations of this methodology either done in two separate stages or trained end-to-end in one go.

Benchmarks

Add a Result

These leaderboards are used to track progress in Lipreading

Dataset	Best Model	Compare
Lip Reading in the Wild	3D Conv + ResNet-18 + DC-TCN + KD (Ensemble) (Word Boundary)	See all
LRS2	CTC/Attention	See all
LRS3-TED	CTC/Attention	See all
CAS-VSR-W1k (LRW-1000)	3D-ResNet + Bi-GRU + MixUp + Label Smooth + Cosine LR (Word Boundary)	See all
GRID corpus (mixed-speech)	CTC/Attention	See all
CMLR	CTC/Attention	See all
LRW-1000	3D Conv + ResNet-34 + Bi-GRU	See all

Libraries

Use these libraries to find Lipreading models and implementations

facebookresearch/av_hubert

2 papers

779

Datasets

Most implemented papers

Most implemented Social Latest No code

LipNet: End-to-End Sentence-level Lipreading

rizkiarm/LipNet • • 5 Nov 2016

Lipreading is the task of decoding text from the movement of a speaker's mouth.

Paper
Code

Combining Residual Networks with LSTMs for Lipreading

tstafylakis/Lipreading-ResNet • • 12 Mar 2017

We propose an end-to-end deep learning architecture for word-level visual speech recognition.

Paper
Code

Deep Audio-Visual Speech Recognition

lordmartian/deep_avsr • • 6 Sep 2018

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

Paper
Code

End-to-end Audio-visual Speech Recognition with Conformers

zziz/pwc • • 12 Feb 2021

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.

Paper
Code

End-to-end Audiovisual Speech Recognition

mpc001/end-to-end-Lipreading • • 18 Feb 2018

In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models.

Paper
Code

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

Fengdalu/Lipreading-DenseNet3D • • 16 Oct 2018

It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.

Paper
Code

Lipreading using Temporal Convolutional Networks

mpc001/Lipreading_using_Temporal_Convolutional_Networks • • 23 Jan 2020

We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.

Paper
Code

Discriminative Multi-modality Speech Recognition

JackSyu/Discriminative-Multi-modality-Speech-Recognition • • CVPR 2020

Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates.

Paper
Code