GitHub - Camb-ai/MARS5-TTS: MARS5 speech model (TTS) from CAMB.AI

The article introduces MARS5, an English speech model developed by CAMB.AI. The model utilizes a two-stage AR-NAR pipeline and can generate speech from just 5 seconds of audio and a snippet of text, making it suitable for diverse scenarios like sports commentary and anime. The model can be guided by punctuation and capitalization, allowing for a natural way to guide the prosody of the generated output. It also allows for a 'deep clone' feature, which improves the quality of the cloning and output, albeit at a slightly longer production time.

The article also provides a quickstart guide for using the model, including installing pip dependencies, loading models, picking a reference and its transcript, and performing the synthesis. The model requires at least 20GB of GPU VRAM to run on a GPU. The checkpoints for MARS5 are provided under the releases tab of the GitHub repo. The article also mentions a roadmap for improving the model's quality, stability, and performance, and invites contributions to the model. The model is open-sourced under GNU AGPL 3.0, but other licensing options can be requested.

Key takeaways:

MARS5 is a novel English speech model developed by CAMB.AI that can generate speech for diverse scenarios with just 5 seconds of audio and a snippet of text.
The model follows a two-stage AR-NAR pipeline and can be guided with punctuation and capitalization for a natural way of guiding the prosody of the generated output.
It supports two kinds of inference: a shallow, fast inference and a second slower, but typically higher quality way, which is called a 'deep clone'.
The model requires at least 20GB of GPU VRAM to run on GPU, but it can also be used via the CAMB.AI API for those who do not have the necessary hardware requirements.

GitHub - Camb-ai/MARS5-TTS: MARS5 speech model (TTS) from CAMB.AI

Key takeaways:

Comments (0)

Newsletter