Scribe is currently available for pre-recorded audio formats, with a low-latency real-time version expected soon. It is priced at $0.40 per hour of transcribed audio, though some competitors offer lower rates. ElevenLabs aims to improve speech detection models by leveraging in-house data annotation teams for quick feedback. The company is also providing tools for customers to transcribe video content for subtitles or captions. CEO Mati Staniszewski emphasized the need for better speech detection models, particularly for languages where current solutions are inadequate.
Key takeaways:
- ElevenLabs launched its first standalone speech-to-text model called Scribe, supporting over 99 languages with excellent accuracy in over 25 languages.
- The Scribe model outperformed Google Gemini 2.0 Flash and Whisper Large V3 in FLEURS & Common Voice benchmark tests.
- Scribe includes features like smart speaker diarization, word-level timestamps, and auto-tagging of sound events, but currently only works with pre-recorded audio formats.
- ElevenLabs plans to release a low-latency real-time version of Scribe soon, with current pricing set at $0.40 per hour of transcribed audio.