Ultravox can be tested using your own audio content, and the latest weights can be downloaded from the Ultravox Hugging Face page. The team behind Ultravox is interested in working with other parties to further develop the model, and they are currently hiring. They also provide a guide for those interested in training their own version of Ultravox, including instructions for setting up the environment and running evaluations.
Key takeaways:
- Ultravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate Audio Speech Recognition (ASR) stage.
- The current version of Ultravox (v0.1) has a time-to-first-token (TTFT) of approximately 200ms, and a tokens-per-second rate of ~100, all using a Llama 3 8B backbone.
- Ultravox can be tested using your own audio content (as a WAV file) via a curl command provided in the article.
- Ultravox is open for contributions and the article provides detailed instructions on how to set up the environment and train your own version of Ultravox.