Matcha-TTS is a non-autoregressive neural text-to-speech architecture that uses conditional flow matching to generate speech quickly while maintaining natural quality. It models speech as an ODE-based generative process, and conditional flow matching lets it reach high-quality audio in only a few synthesis steps, which greatly reduces latency compared to score-matching diffusion approaches. The model is fully probabilistic, so it can generate diverse realizations of the same text while still sounding stable and intelligible. The repository provides an end-to-end TTS pipeline: a PyTorch/Lightning training stack, configuration files, pre-trained checkpoints, a command-line interface, and a Gradio app for interactive testing. Users can train on standard datasets like LJSpeech or plug in their own corpora, with helper tools for computing dataset statistics, extracting phoneme durations, and running multi-GPU training.
Features
- Non-autoregressive TTS architecture based on conditional flow matching for fast synthesis
- Probabilistic speech generation with natural-sounding, high-quality audio outputs
- Ready-to-use CLI and Gradio app for text-to-speech from the terminal or browser
- Full training pipeline with Hydra configs, Lightning runner, and multi-GPU support
- ONNX export and ONNX Runtime inference, with optional end-to-end vocoder integration
- Utilities for dataset normalization, phoneme alignment extraction, and custom-dataset training