Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference - Cerebras

Nov 19, 2024 - cerebras.ai
Frontier AI has achieved a new speed record with its Llama 3.1 405B model on Cerebras, running at 969 tokens/s, making it 12x faster than GPT-4o and 18x faster than Claude 3.5 Sonnet. The model also achieved the highest performance at 128K context length and shortest time-to-first-token latency. The Cerebras Inference technology enables the model to run at instant speed, generating 969 output tokens/s with a 1,000 token prompt, which is 8x faster than SambaNova, 12x faster than the fastest GPU cloud, and 75x faster than AWS.

Cerebras Inference for Llama 3.1-405B is available for customer trials, with general availability expected in Q1 2025. The pricing is $6 per million input tokens and $12 per million output tokens, which is 20% lower than AWS, Azure, and GCP. The combination of Meta’s open approach and Cerebras’s breakthrough inference technology has made Llama 3.1-405B more than 10 times faster than closed frontier models, making it ideal for voice, video, and reasoning applications where minimal latency and maximum reasoning steps are crucial.

Key takeaways:

  • Frontier AI now runs at instant speed with Llama 3.1 405B on Cerebras, achieving a new record of 969 tokens/s, making it 12x faster than GPT-4o and 18x faster than Claude 3.5 Sonnet.
  • Cerebras Inference for Llama 3.1-405B offers the fastest time-to-first-token of any platform, with a latency of just 240 milliseconds, improving the user experience for real-time AI applications.
  • The service is available for customer trials now, with general availability coming in Q1 2025. Pricing is $6 per million input tokens and $12 per million output tokens, which is 20% lower than AWS, Azure, and GCP.
  • Llama 3.1-405B, thanks to Meta’s open approach and Cerebras’s breakthrough inference technology, now runs more than 10 times faster than closed frontier models, making it ideal for applications where minimal latency and maximum reasoning steps are crucial.
View Full Article

Comments (0)

Be the first to comment!