Optimizing AI Inference at Character.AI

Character.AI, a full-stack AI company, is working towards Artificial General Intelligence (AGI) with a focus on optimizing large language models (LLMs) for global use. They have developed a memory-efficient architecture design and stateful caching system to enhance the efficiency, cost-effectiveness, and scalability of LLMs. Their system can handle over 20,000 inference queries per second, which is about 20% of the volume served by Google Search. They have also implemented int8 quantization on model weights, activations, and attention KV cache, which significantly improves training efficiency.

The company's innovations have reduced serving costs by a factor of 33 since late 2022, making it 13.5 times cheaper than using leading commercial APIs. Character.AI is committed to further optimizing LLMs to drive innovation and enhance experiences worldwide. They envision a future where efficient and scalable AI systems are integral to every interaction.

Key takeaways:

Character.AI is working on optimizing the inference process of large language models (LLMs) to make them more efficient, cost-effective, and scalable.
They have developed techniques to reduce the size of the cache of attention keys and values (KV) by more than 20X, which is a key bottleneck of LLM inference throughput.
One of their key innovations is an efficient system for caching attention KV on host memory between chat turns, achieving a 95% cache rate and further reducing inference cost.
They use int8 quantization on model weights, activations, and attention KV cache, and natively train their models in int8 precision, which improves training efficiency and eliminates the risk of training/serving mismatch.

Optimizing AI Inference at Character.AI

Key takeaways:

Comments (0)

Newsletter