The BASE TTS models are streamable, meaning they don't need to generate whole sentences at once but can go moment by moment at a relatively low bitrate. The team also attempted to package speech metadata like emotionality and prosody in a separate, low-bandwidth stream. However, the model is still experimental and not commercial. The researchers have not published the model's source and other data due to the risk of misuse by bad actors.
Key takeaways:
- Researchers at Amazon have trained the largest ever text-to-speech model, called Big Adaptive Streamable TTS with Emergent abilities (BASE TTS), which exhibits improved ability to speak complex sentences naturally.
- The model uses 100,000 hours of public domain speech in multiple languages and has 980 million parameters, making it the largest in its category.
- The BASE TTS model has shown emergent abilities, performing tasks it wasn't specifically trained for, such as handling compound nouns, emotions, foreign words, paralinguistics, punctuations, questions, and syntactic complexities.
- Despite its success, the team has declined to publish the model's source and other data due to the risk of misuse by bad actors.