Audio Classification
131 papers with code • 20 benchmarks • 34 datasets
Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds.
Libraries
Use these libraries to find Audio Classification models and implementationsDatasets
Subtasks
Most implemented papers
CNN Architectures for Large-Scale Audio Classification
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio.
Perceiver: General Perception with Iterative Attention
The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models.
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks.
Multi-level Attention Model for Weakly Supervised Audio Classification
The objective of audio classification is to predict the presence or absence of audio events in an audio clip.
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights
Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers.
LEAF: A Learnable Frontend for Audio Classification
In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio signals, including speech, music, audio events and animal sounds, providing a general-purpose learned frontend for audio classification.
ATST: Audio Representation Learning with Teacher-Student Transformer
Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the knowledge to a specific problem with a limited number of labeled data.
Masked Autoencoders that Listen
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.
Convolutional RNN: an Enhanced Model for Extracting Features from Sequential Data
Traditional convolutional layers extract features from patches of data by applying a non-linearity on an affine function of the input.