This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.
269 PAPERS • 3 BENCHMARKS
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus. The main differences from the LibriSpeech corpus are listed below:
187 PAPERS • 1 BENCHMARK
THCHS-30 is a free Chinese speech database THCHS-30 that can be used to build a full-fledged Chinese speech recognition system.
30 PAPERS • NO BENCHMARKS YET
A collection of single speaker speech datasets for ten languages. It is composed of short audio clips from LibriVox audiobooks and their aligned texts.
21 PAPERS • NO BENCHMARKS YET
A large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels.
7 PAPERS • NO BENCHMARKS YET
JSUT Corpus is a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist.
PromptSpeech is a dataset that consists of speech and the corresponding prompts. We synthesize speech with 5 different style factors (gender, pitch, speaking speed, volume, and emotion) from a commercial TTS API. The emotion factor has 5 categories and the gender factor has 2 categories.
The SOMOS dataset is a large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It can be employed to train automatic MOS prediction systems focused on the assessment of modern synthesizers, and can stimulate advancements in acoustic model evaluation. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset which is a common benchmark for building neural acoustic models and vocoders. Utterances are generated from 200 TTS systems including vanilla neural acoustic models as well as models which allow prosodic variations.
6 PAPERS • NO BENCHMARKS YET
The data set contains several speakers. The 5 largest are listed individually, the rest are summarized as other. All audio files have a sampling rate of 44.1kHz. For each speaker, there is a clean variant in addition to the full data set, where the quality is even higher. Furthermore, there are various statistics. The dataset can also be used for automatic speech recognition (ASR) if audio files are converted to 16 kHz.
3 PAPERS • 2 BENCHMARKS
The Tongue and Lips (TaL) corpus is a multi-speaker corpus of ultrasound images of the tongue and video images of lips. This corpus contains synchronised imaging data of extraoral (lips) and intraoral (tongue) articulators from 82 native speakers of English.
3 PAPERS • NO BENCHMARKS YET
The Jejueo Interview Transcripts (JIT) dataset is a parallel corpus containing 170k+ Jejueo-Korean sentences.
2 PAPERS • NO BENCHMARKS YET
The dataset has 10.5 hours from a single speaker.
Tilde MODEL Corpus is a multilingual corpora for European languages – particularly focused on the smaller languages. The collected resources have been cleaned, aligned, and formatted into a corpora standard TMX format useable for developing new Language technology products and services.
VocBench is a framework that benchmark the performance of state-of-the art neural vocoders. VocBench uses a systematic study to evaluate different neural vocoders in a shared environment that enables a fair comparison between them.
A database containing high sampling rate recordings of a single speaker reading sentences in Brazilian Portuguese with neutral voice, along with the corresponding text corpus. Intended for speech synthesis and automatic speech recognition applications, the dataset contains text extracted from a popular Brazilian news TV program, totalling roughly 20 h of audio spoken by a trained individual in a controlled environment. The text was normalized in the recording process and special textual occurrences (e.g. acronyms, numbers, foreign names etc.) were replaced by their phonetic translation to a readable text in Portuguese. There are no noticeable accidental sounds and background noise has been kept to a minimum in all audio samples.
1 PAPER • NO BENCHMARKS YET
The Jejueo Single Speaker Speech (JSS) dataset consists of 10k high-quality audio files recorded by a native Jejueo speaker and a transcript file.
JVS-MuSiC is a Japanese multispeaker singing-voice corpus called "JVS-MuSiC" with the aim to analyze and synthesize a variety of voices. The corpus consists of 100 singers' recordings of the same song, Katatsumuri, which is a Japanese children's song. It also includes another song that is different for each singer.
RUSLAN is a Russian spoken language corpus for text-to-speech task. RUSLAN contains 22,200 audio samples with text annotations – more than 31 hours of high-quality speech of one person – being one of the largest annotated Russian corpus in terms of speech duration for a single speaker.
Facial electromyography recordings during both silent and vocalized speech.