GitHub - fixie-ai/ultravox

Ultravox is a multimodal Language Learning Model (LLM) that can understand both text and human speech without the need for a separate Audio Speech Recognition (ASR) stage. It builds on research from AudioLM, SeamlessM4T, Gazelle, SpeechGPT, and others, extending Meta's Llama 3 model with a multimodal projector that converts audio directly into the high-dimensional space used by Llama 3. The current version of Ultravox has a time-to-first-token (TTFT) of approximately 200ms, and a tokens-per-second rate of around 100, using a Llama 3 8B backbone.

Ultravox can be tested using your own audio content, and the latest weights can be downloaded from the Ultravox Hugging Face page. The team behind Ultravox is interested in working with other parties to further develop the model, and they are currently hiring. They also provide a guide for those interested in training their own version of Ultravox, including instructions for setting up the environment and running evaluations.

Key takeaways:

Ultravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate Audio Speech Recognition (ASR) stage.
The current version of Ultravox (v0.1) has a time-to-first-token (TTFT) of approximately 200ms, and a tokens-per-second rate of ~100, all using a Llama 3 8B backbone.
Ultravox can be tested using your own audio content (as a WAV file) via a curl command provided in the article.
Ultravox is open for contributions and the article provides detailed instructions on how to set up the environment and train your own version of Ultravox.

GitHub - fixie-ai/ultravox

Key takeaways:

Comments (0)

Newsletter