Largest text-to-speech AI model yet shows 'emergent abilities'

Researchers at Amazon have developed the largest text-to-speech model to date, known as Big Adaptive Streamable TTS with Emergent abilities (BASE TTS). The model, which uses 100,000 hours of public domain speech, exhibits emergent qualities that improve its ability to speak complex sentences naturally. The team found that as language learning models (LLMs) grow in size, they become more robust and versatile, performing tasks they were not explicitly trained for. The BASE TTS model, with 980 million parameters, showed a significant leap in ability, especially in handling tricky text and complex sentences.

The BASE TTS models are streamable, meaning they don't need to generate whole sentences at once but can go moment by moment at a relatively low bitrate. The team also attempted to package speech metadata like emotionality and prosody in a separate, low-bandwidth stream. However, the model is still experimental and not commercial. The researchers have not published the model's source and other data due to the risk of misuse by bad actors.

Key takeaways

Researchers at Amazon have trained the largest ever text-to-speech model, called Big Adaptive Streamable TTS with Emergent abilities (BASE TTS), which exhibits improved ability to speak complex sentences naturally.
The model uses 100,000 hours of public domain speech in multiple languages and has 980 million parameters, making it the largest in its category.
The BASE TTS model has shown emergent abilities, performing tasks it wasn't specifically trained for, such as handling compound nouns, emotions, foreign words, paralinguistics, punctuations, questions, and syntactic complexities.
Despite its success, the team has declined to publish the model's source and other data due to the risk of misuse by bad actors.

Largest text-to-speech AI model yet shows 'emergent abilities' | TechCrunch

Key takeaways

Discussion (0)