The article also provides a quickstart guide for using the model, including installing pip dependencies, loading models, picking a reference and its transcript, and performing the synthesis. The model requires at least 20GB of GPU VRAM to run on a GPU. The checkpoints for MARS5 are provided under the releases tab of the GitHub repo. The article also mentions a roadmap for improving the model's quality, stability, and performance, and invites contributions to the model. The model is open-sourced under GNU AGPL 3.0, but other licensing options can be requested.
Key takeaways:
- MARS5 is a novel English speech model developed by CAMB.AI that can generate speech for diverse scenarios with just 5 seconds of audio and a snippet of text.
- The model follows a two-stage AR-NAR pipeline and can be guided with punctuation and capitalization for a natural way of guiding the prosody of the generated output.
- It supports two kinds of inference: a shallow, fast inference and a second slower, but typically higher quality way, which is called a 'deep clone'.
- The model requires at least 20GB of GPU VRAM to run on GPU, but it can also be used via the CAMB.AI API for those who do not have the necessary hardware requirements.