Cerebras Inference for Llama 3.1-405B is available for customer trials, with general availability expected in Q1 2025. The pricing is $6 per million input tokens and $12 per million output tokens, which is 20% lower than AWS, Azure, and GCP. The combination of Meta’s open approach and Cerebras’s breakthrough inference technology has made Llama 3.1-405B more than 10 times faster than closed frontier models, making it ideal for voice, video, and reasoning applications where minimal latency and maximum reasoning steps are crucial.
Key takeaways:
- Frontier AI now runs at instant speed with Llama 3.1 405B on Cerebras, achieving a new record of 969 tokens/s, making it 12x faster than GPT-4o and 18x faster than Claude 3.5 Sonnet.
- Cerebras Inference for Llama 3.1-405B offers the fastest time-to-first-token of any platform, with a latency of just 240 milliseconds, improving the user experience for real-time AI applications.
- The service is available for customer trials now, with general availability coming in Q1 2025. Pricing is $6 per million input tokens and $12 per million output tokens, which is 20% lower than AWS, Azure, and GCP.
- Llama 3.1-405B, thanks to Meta’s open approach and Cerebras’s breakthrough inference technology, now runs more than 10 times faster than closed frontier models, making it ideal for applications where minimal latency and maximum reasoning steps are crucial.