The model requires at least 4GB of GPU memory for a 30-second audio clip and generates audio at about 7 semantic tokens per second on a 4090 GPU. It may have stability issues, common with autoregressive models, but multiple samples can be tried to find a suitable result. Currently, the token-level control units include laughter, uv_break, and lbreak, but future versions may include additional emotional control capabilities.
Key takeaways:
- ChatTTS is a text-to-speech model optimized for dialogue scenarios, supporting English and Chinese, and trained on over 100,000 hours of data.
- The model allows for fine-grained control, enabling the prediction and control of prosodic features like laughter, pauses, and interjections.
- ChatTTS surpasses most open-source TTS models in prosody, delivering a lifelike experience.
- Current token-level control units are [laugh], [uv_break], and [lbreak], with potential for additional emotional control capabilities in future versions.