DCASE 2016 is a dataset for sound event detection. It consists of 20 short mono sound files for each of 11 sound classes (from office environments, like clearthroat, drawer, or keyboard), each file containing one sound event instance. Sound files are annotated with event on- and offset times, however silences between actual physical sounds (like with a phone ringing) are not marked and hence “included” in the event.
30 PAPERS • NO BENCHMARKS YET
The FSDnoisy18k dataset is an open dataset containing 42.5 hours of audio across 20 sound event classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. The audio content is taken from Freesound, and the dataset was curated using the Freesound Annotator. The noisy set of FSDnoisy18k consists of 15,813 audio clips (38.8h), and the test set consists of 947 audio clips (1.4h) with correct labels. The dataset features two main types of label noise: in-vocabulary (IV) and out-of-vocabulary (OOV). IV applies when, given an observed label that is incorrect or incomplete, the true or missing label is part of the target class set. Analogously, OOV means that the true or missing label is not covered by those 20 classes.
18 PAPERS • NO BENCHMARKS YET
The DESED dataset is a dataset designed to recognize sound event classes in domestic environments. The dataset is designed to be used for sound event detection (SED, recognize events with their time boundaries) but it can also be used for sound event tagging (SET, indicate presence of an event in an audio file). The dataset is composed of 10 event classes to recognize in 10 second audio files. The classes are: Alarm/bell/ringing, Blender, Cat, Dog, Dishes, Electric shaver/toothbrush, Frying, Running water, Speech, Vacuum cleaner.
13 PAPERS • 1 BENCHMARK
L3DAS22: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING This dataset supports the L3DAS22 IEEE ICASSP Gand Challenge. The challenge is supported by a Python API that facilitates the dataset download and preprocessing, the training and evaluation of the baseline models and the results submission.
12 PAPERS • NO BENCHMARKS YET
DCASE 2013 is a dataset for sound event detection. It consists of audio-only recordings where individual sound events are prominent in an acoustic scene.
11 PAPERS • NO BENCHMARKS YET
The TUT Sound Events 2017 dataset contains 24 audio recordings in a street environment and contains 6 different classes. These classes are: brakes squeaking, car, children, large vehicle, people speaking, and people walking.
8 PAPERS • NO BENCHMARKS YET
TUT-SED Synthetic 2016 contains of mixture signals artificially generated from isolated sound events samples. This approach is used to get more accurate onset and offset annotations than in dataset using recordings from real acoustic environments where the annotations are always subjective. Mixture signals in the dataset are created by randomly selecting and mixing isolated sound events from 16 sound event classes together. The resulting mixtures contains sound events with varying polyphony. All together 994 sound event samples were purchased from Sound Ideas. From the 100 mixtures created, 60% were assigned for training, 20% for testing and 20% for validation. The total amount of audio material in the dataset is 566 minutes. Different instances of the sound events are used to synthesize the training, validation and test partitions. Mixtures were created by randomly selecting event instance and from it, randomly, a segment of length 3-15 seconds. Between events, random length silent re
7 PAPERS • NO BENCHMARKS YET
L3DAS21 is a dataset for 3D audio signal processing. It consists of a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage.
5 PAPERS • 2 BENCHMARKS
The TAU-NIGENS Spatial Sound Events 2021 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a single direction-of-arrival (DoA) if static, a trajectory DoAs if moving, and a temporal onset and offset time. The isolated sound event recordings used for t
4 PAPERS • 1 BENCHMARK
The BirdVox-full-night dataset contains 6 audio recordings, each about ten hours in duration. These recordings come from ROBIN autonomous recording units, placed near Ithaca, NY, USA during the fall 2015. They were captured on the night of September 23rd, 2015, by six different sensors, originally numbered 1, 2, 3, 5, 7, and 10. Andrew Farnsworth used the Raven software to pinpoint every avian flight call in time and frequency. He found 35402 flight calls in total. He estimates that about 25 different species of passerines (thrushes, warblers, and sparrows) are present in this recording. Species are not labeled in BirdVox-full-night, but it is possible to tell apart thrushes from warblers and sparrrows by looking at the center frequencies of their calls. The annotation process took 102 hours.
2 PAPERS • NO BENCHMARKS YET
The DCASE 2017 rare sound events dataset contains isolated sound events for three classes: 148 crying babies (mean duration 2.25s), 139 glasses breaking (mean duration 1.16s), and 187 gun shots (mean duration 1.32s). As with the DCASE 2016 data, silences are not excluded from active event markings in the annotations. While this data set contains many samples per class, there are only three classes
This repository contains the SINGA:PURA dataset, a strongly-labelled polyphonic urban sound dataset with spatiotemporal context. The data were collected via a number of recording units deployed across Singapore as a part of a wireless acoustic sensor network. These recordings were made as part of a project to identify and mitigate noise sources in Singapore, but also possess a wider applicability to sound event detection, classification, and localization. The taxonomy we used for the labels in this dataset has been designed to be compatible with other existing datasets for urban sound tagging while also able to capture sound events unique to the Singaporean context. Please refer to our conference paper published in APSIPA 2021 (which is found in this repository as the file "APSIPA.pdf") or download the readme ("Readme.md") for more details regarding the data collection, annotation, and processing methodologies for the creation of the dataset.
The TAU-NIGENS Spatial Sound Events 2020 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a trajectory of its direction-of-arrival (DoA) to the recording point, and a temporal onset and offset time. The isolated sound event recordings used for the sy
URBAN-SED is a dataset of 10,000 soundscapes with sound event annotations generated using the scraper library. The dataset includes 10,000 soundscapes, totals almost 30 hours and includes close to 50,000 annotated sound events. Every soundscape is 10 seconds long and has a background of Brownian noise resembling the typical “hum” often heard in urban environments. Every soundscape contains between 1-9 sound evnts from the following classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren and street_music. The source material for the sound events are the clips from the UrbanSound8K dataset. URBAN-SED comes pre-sorted into three sets: train, validate and test. There are 6000 soundscapes in the training set, generated using clips from folds 1-6 in UrbanSound8K, 2000 soundscapes in the validation set, generated using clips from fold 7-8 in UrbanSound8K, and 2000 soundscapes in the test set, generated using clips from folds 9-10 in
VOICe is a dataset for the development and evaluation of domain adaptation methods for sound event detection. VOICe consists of mixtures with three different sound events ("baby crying", "glass breaking", and "gunshot"), which are over-imposed over three different categories of acoustic scenes: vehicle, outdoors, and indoors. Moreover, the mixtures are also offered without any background noise.
This dataset is an audio dataset containing about 1500 audio clips recorded by multiple professional players.
1 PAPER • NO BENCHMARKS YET
USM-SED is a dataset for polyphonic sound event detection in urban sound monitoring use-cases. Based on isolated sounds taken from the FSD50k dataset, 20,000 polyphonic soundscapes are synthesized with sounds being randomly positioned in the stereo panorama using different loudness levels.
The TUT Sounds Event 2018 dataset consists of real-life first order Ambisonic (FOA) format recordings with stationary point sources each associated with a spatial coordinate. The dataset was generated by collecting impulse responses (IR) from a real environment using the Eigenmike spherical microphone array. The measurement was done by slowly moving a Genelec G Two loudspeaker continuously playing a maximum length sequence around the array in circular trajectory in one elevation at a time. The playback volume was set to be 30 dB greater than the ambient sound level. The recording was done in a corridor inside the university with classrooms around it during work hours. The IRs were collected at elevations −40 to 40 with 10-degree increments at 1 m from the Eigenmike and at elevations −20 to 20 with 10-degree increments at 2 m.
0 PAPER • NO BENCHMARKS YET