The Ultimate Conversational TTS Model

ChatTTS is a state-of-the-art text-to-speech model designed for dialogue scenarios such as chatbots and virtual assistants. It supports English and Chinese and is trained on over 100,000 hours of data to deliver natural and expressive speech. The model is optimized for dialogue, allowing for interactive conversations with multiple speakers. It also offers fine-grained control over prosodic features like laughter, pauses, and interjections, and surpasses most open-source TTS models in delivering lifelike prosody.

The model requires at least 4GB of GPU memory for a 30-second audio clip and generates audio at about 7 semantic tokens per second on a 4090 GPU. It may have stability issues, common with autoregressive models, but multiple samples can be tried to find a suitable result. Currently, the token-level control units include laughter, uv_break, and lbreak, but future versions may include additional emotional control capabilities.

Key takeaways:

ChatTTS is a text-to-speech model optimized for dialogue scenarios, supporting English and Chinese, and trained on over 100,000 hours of data.
The model allows for fine-grained control, enabling the prediction and control of prosodic features like laughter, pauses, and interjections.
ChatTTS surpasses most open-source TTS models in prosody, delivering a lifelike experience.
Current token-level control units are [laugh], [uv_break], and [lbreak], with potential for additional emotional control capabilities in future versions.

The Ultimate Conversational TTS Model

Key takeaways:

Comments (0)

Newsletter