ElevenLabs is launching its own speech-to-text model

ElevenLabs, an AI startup valued at $3.3 billion, has launched its first standalone speech-to-text model, Scribe, which supports over 99 languages. The model is designed to compete with existing solutions like Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper models. Scribe categorizes over 25 languages, including English, French, and Spanish, in the excellent accuracy category with a word error rate of less than 5%. The model has outperformed Google Gemini 2.0 Flash and Whisper Large V3 in benchmark tests and includes features like smart speaker diarization, word-level timestamps, and auto-tagging of sound events.

Scribe is currently available for pre-recorded audio formats, with a low-latency real-time version expected soon. It is priced at $0.40 per hour of transcribed audio, though some competitors offer lower rates. ElevenLabs aims to improve speech detection models by leveraging in-house data annotation teams for quick feedback. The company is also providing tools for customers to transcribe video content for subtitles or captions. CEO Mati Staniszewski emphasized the need for better speech detection models, particularly for languages where current solutions are inadequate.

Key takeaways

ElevenLabs launched its first standalone speech-to-text model called Scribe, supporting over 99 languages with excellent accuracy in over 25 languages.
The Scribe model outperformed Google Gemini 2.0 Flash and Whisper Large V3 in FLEURS & Common Voice benchmark tests.
Scribe includes features like smart speaker diarization, word-level timestamps, and auto-tagging of sound events, but currently only works with pre-recorded audio formats.
ElevenLabs plans to release a low-latency real-time version of Scribe soon, with current pricing set at $0.40 per hour of transcribed audio.

ElevenLabs is launching its own speech-to-text model | TechCrunch

Key takeaways

Discussion (0)