Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Introducing Cerebras Inference: AI at Instant Speed - Cerebras

Aug 28, 2024 - news.bensbites.com
Cerebras Inference is designed to serve models with billions to trillions of parameters, splitting them at layer boundaries and mapping them to multiple CS-3 systems when they exceed the memory capacity of a single wafer. The company plans to add larger models like Llama3-405B and Mistral Large, and uses original 16-bit weights for models, which score up to 5% higher than 8-bit models. The Cerebras Inference API is available for developers to integrate, offering high performance, speed, accuracy, and cost-effectiveness.

The company emphasizes the importance of fast inference, which enables more complex AI workflows and enhances real-time LLM intelligence. Techniques like scaffolding, which require up to 100x more tokens at runtime, are only possible in real time on Cerebras hardware. Cerebras Inference, with its high-speed training and inference capabilities, sets a new standard for open LLM development and deployment.

Key takeaways:

  • Cerebras inference is designed to serve models from billions to trillions of parameters, with larger models such as Llama3-405B and Mistral Large to be added soon.
  • The system uses the original 16-bit weights released by Meta for Llama3.1 8B and 70B models, ensuring the most accurate and reliable model output.
  • Cerebras inference is available today via chat and API access, offering the best combination of performance, speed, accuracy and cost.
  • Fast inference enables more complex AI workflows and enhances real-time LLM intelligence, with new techniques like scaffolding providing over 10x performance on demanding tasks like code generation.
View Full Article

Comments (0)

Be the first to comment!