Audio-Visual Active Speaker Detection
10 papers with code • 2 benchmarks • 3 datasets
Determine if and when each visible person in the video is speaking.
Most implemented papers
Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers.
Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection
Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows.
AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible.
Active Speakers in Context
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
MAAS: Multi-modal Assignation for Active Speaker Detection
Active speaker detection requires a solid integration of multi-modal cues.
NUS-HLT Report for ActivityNet Challenge 2021 AVA (Speaker)
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers.
How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild
Successful active speaker detection requires a three-stage pipeline: (i) audio-visual encoding for all speakers in the clip, (ii) inter-speaker relation modeling between a reference speaker and the background speakers within each frame, and (iii) temporal modeling for the reference speaker.
End-to-End Active Speaker Detection
Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation.
Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
A Light Weight Model for Active Speaker Detection
Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94. 1% vs. 94. 2%), while the resource costs are significantly lower than the state-of-the-art method, especially in model parameters (1. 0M vs. 22. 5M, about 23x) and FLOPs (0. 6G vs. 2. 6G, about 4x).