Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog. All the videos are split into Evaluation/Balanced-Train/Unbalanced-Train set.
586 PAPERS • 6 BENCHMARKS
WSJ0-2mix is a speech recognition corpus of speech mixtures using utterances from the Wall Street Journal (WSJ0) corpus.
144 PAPERS • 2 BENCHMARKS
The MUSDB18 is a dataset of 150 full lengths music tracks (~10h duration) of different genres along with their isolated drums, bass, vocals and others stems.
91 PAPERS • 2 BENCHMARKS
The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background scene. It has an extension called WHAMR! that adds artificial reverberation to the speech signals in addition to the background noise.
79 PAPERS • 5 BENCHMARKS
LibriMix is an open-source alternative to wsj0-2mix. Based on LibriSpeech, LibriMix consists of two- or three-speaker mixtures combined with ambient noise samples from WHAM!.
78 PAPERS • 1 BENCHMARK
WHAMR! is a dataset for noisy and reverberant speech separation. It extends WHAM! by introducing synthetic reverberation to the speech sources in addition to the existing noise. Room impulse responses were generated and convolved using pyroomacoustics. Reverberation times were chosen to approximate domestic and classroom environments (expected to be similar to the restaurants and coffee shops where the WHAM! noise was collected), and further classified as high, medium, and low reverberation based on a qualitative assessment of the mixture’s noise recording.
45 PAPERS • 3 BENCHMARKS
The DNS Challenge at INTERSPEECH 2020 intended to promote collaborative research in single-channel Speech Enhancement aimed to maximize the perceptual quality and intelligibility of the enhanced speech. The challenge evaluated the speech quality using the online subjective evaluation framework ITU-T P.808. The challenge provides large datasets for training noise suppressors.
38 PAPERS • 3 BENCHMARKS
AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
35 PAPERS • NO BENCHMARKS YET
Spatialized Multi-Speaker Wall Street Journal (SMS-WSJ) consists of artificially mixed speech taken from the WSJ database, but unlike earlier databases this one considers all WSJ0+1 utterances and takes care of strictly separating the speaker sets present in the training, validation and test sets.
14 PAPERS • NO BENCHMARKS YET
The Free Universal Sound Separation (FUSS) dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation. FUSS is based on FSD50K corpus.
13 PAPERS • NO BENCHMARKS YET
OpenMIC-2018 is an instrument recognition dataset containing 20,000 examples of Creative Commons-licensed music available on the Free Music Archive. Each example is a 10-second excerpt which has been partially labeled for the presence or absence of 20 instrument classes by annotators on a crowd-sourcing platform.
7 PAPERS • 1 BENCHMARK
MedleyVox is an evaluation dataset for multiple singing voices separation that corresponds to such categories. The problem definition in this dataset is categorised into i) duet, ii) unison, iii) main vs. rest, and iv) N-singing separation.
2 PAPERS • NO BENCHMARKS YET
Kinect-WSJ is a multichannel, multispeaker, reverberated, noisy dataset which extends the WSJ0-2mix singlechannel, non-reverberated, noiseless dataset to the strong reverberation and noise conditions and the Kinect-like microphone array geometry used in CHiME-5.
1 PAPER • NO BENCHMARKS YET
jaCappella is a corpus of Japanese a cappella vocal ensembles (jaCappella corpus) for vocal ensemble separation and synthesis. It consists of 35 copyright-cleared vocal ensemble songs and their audio recordings of individual voice parts. These songs were arranged from out-of-copyright Japanese children's songs and have six voice parts (lead vocal, soprano, alto, tenor, bass, and vocal percussion). They are divided into seven subsets, each of which features typical characteristics of a music genre such as jazz and enka.
1 PAPER • 1 BENCHMARK