The PlayHT2.0 model is a significant improvement over the previous PlayHT1.0 model, which had limitations such as poor zero-shot capabilities, short speech generations, inability to control speech styles or emotions, and only worked in English. The new model is more robust, has reduced latency to conversational real-time levels, and can generate speech in less than 800ms. The model is now available through the PlayHT Studio and API in alpha, with major updates expected in the coming weeks.
Key takeaways:
- PlayHT has introduced a new Generative Text-to-Voice AI Model, PlayHT2.0, that can generate conversational speech and introduces the concept of Emotions to Generative Voice AI.
- The new model has improved capabilities including real-time speech generation, instant voice cloning, cross-language and accent cloning, and directing emotions.
- PlayHT2.0 was trained on a dataset of more than 1 million hours of speech across multiple languages, accents, and speaking styles, and can generate speech in less than 800ms.
- The model is currently available in alpha through PlayHT's Studio and API, with major updates expected in the coming weeks to further improve its quality, speed, and capabilities.