Speech Synthesis
290 papers with code • 4 benchmarks • 19 datasets
Speech synthesis is the task of generating speech from some other modality like text, lip movements etc.
Please note that the leaderboards here are not really comparable between studies - as they use mean opinion score as a metric and collect different samples from Amazon Mechnical Turk.
( Image credit: WaveNet: A generative model for raw audio )
Libraries
Use these libraries to find Speech Synthesis models and implementationsDatasets
Subtasks
- Expressive Speech Synthesis
- Emotional Speech Synthesis
- text-to-speech translation
- Speech Synthesis - Tamil
- Speech Synthesis - Tamil
- Speech Synthesis - Kannada
- Speech Synthesis - Malayalam
- Speech Synthesis - Telugu
- Speech Synthesis - Assamese
- Speech Synthesis - Bengali
- Speech Synthesis - Bodo
- Speech Synthesis - Gujarati
- Speech Synthesis - Hindi
- Speech Synthesis - Manipuri
- Speech Synthesis - Marathi
- Speech Synthesis - Rajasthani
Most implemented papers
WaveNet: A Generative Model for Raw Audio
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text.
Tacotron: Towards End-to-End Speech Synthesis
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
FastSpeech: Fast, Robust and Controllable Text to Speech
In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques.
Efficient Neural Audio Synthesis
The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time.
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Clone a voice in 5 seconds to generate arbitrary speech in real-time