What is Coqui 🐸
Coqui is an advanced Text-to-Speech (TTS) generation library based on the latest research, similar to Mozilla’s TTS. It is designed to achieve the best balance between ease of training, speed and quality. Coqui includes pre-trained models, tools to measure the quality of datasets and is used in more than 20 languages for products and research projects.
It is open source and free. OpenSource Sealed 🦙
But it is also an online platform that sells the service of voice synthesis.
Características de Coqui
Main characteristics of Coqui:
- High-performance Deep Learning models for Text2Speech tasks.
- Modelos Text2Spec (Tacotron, Tacotron2, Glow-TTS, SpeedySpeech).
- Loudspeaker encoder to calculate loudspeaker inlays efficiently.
- Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS, ParallelWaveGAN, WaveGrad, WaveRNN).
- Fast and efficient model training.
- Detailed training records on the terminal and Tensorboard.
- Multi-language and multi-speaker TTS support.
- API Trainer is efficient, flexible and lightweight yet full-featured.
- Models launched and ready to use.
- Tools for curating Text2Speech datasets.
- Utilities to use and test models.
- Modular (but not too modular) code base that allows easy implementation of new ideas.
Models implemented in Coqui:
- Modelos de espectrograma (Tacotron, Tacotron2, Glow-TTS, Speedy-Speech, Align-TTS, FastPitch, FastSpeech, FastSpeech2, SC-GlowTTS, Capacitron, OverFlow, Neural HMM TTS).
- End-to-end models (VITS, YourTTS).
- Attention methods (guided attention, forward and backward decoding, Graves attention, dual decoder consistency, dynamic convolutional attention, alignment network).
- Speaker encoder (GE2E, Angular Loss).
- Vocoders (MelGAN, MultiBandMelGAN, ParallelWaveGAN, GAN-TTS discriminators, WaveRNN, WaveGrad, HiFiGAN, UnivNet).
- Voice conversion (FreeVC).
Coqui can be installed using pip or by cloning the GitHub repository and running an install command. It is also possible to use a Docker image to test Coqui without installing it. The library provides a Python API and a command line interface for synthesizing speech with pre-trained and custom models.
In summary, Coqui is a complete and advanced library for text-to-speech generation, offering a wide range of models and features to facilitate the creation and use of high quality TTS systems.
How to install Coqui 🍣
Here I will guide you through the installation of Coqui on a Linux operating system, specifically Fedora, for CentOS it would be the same. If you use a .deb based distro like Ubuntu or Debian you only have to change the installer “dnf install” for “apt-get install”. Don’t forget to do all this as root, to do so use the “su” command in the terminal.
Step 1: Install or upgrade Python
Check which version you have installed:
If it is lower than version 3.7. Upgrade using the following command:
sudo dnf install python3-devel
Step 2: Install Coqui dependencies
sudo dnf install espeak-ng libsndfile
Step 3: Clone the Coqui repository
git clone https://github.com/coqui-ai/TTS
And navigate to the directory where you have cloned it
Step 4: Install TTS Coqui
sudo pip install -e .[all,dev,notebooks]
How to synthesize voice with Coqui 📢
Actually you can do it graphically from the browser, but here we are going to do it from the terminal, in two ways. The simple one, open the terminal and use this command:
tts --text "Texto que deseas sintetizar" --out_path "ruta/del/archivo/output.wav"
You only have to change the text in quotation marks for the text you want to synthesize and the path you have to change it for your own. And it will create an audio file with that text. If you want to choose the model and configure it use the following command:
from TTS.utils.synthesizer import Synthesizer # Configuración del modelo tts_model = "tts_models/es/forward_tacotron/tts.pt" tts_config = "tts_models/es/forward_tacotron/config.json" vocoder_model = "vocoder_models/universal/libri-tts/wavegrad.pt" vocoder_config = "vocoder_models/universal/libri-tts/config.json" # Crear una instancia del sintetizador synthesizer = Synthesizer(tts_model, tts_config, vocoder_model, vocoder_config) # Sintetizar el texto y guardar el archivo de audio texto = "Texto que deseas sintetizar" ruta_salida = "ruta/del/archivo/output.wav" synthesizer.tts(texto, ruta_salida)