The company's innovations have reduced serving costs by a factor of 33 since late 2022, making it 13.5 times cheaper than using leading commercial APIs. Character.AI is committed to further optimizing LLMs to drive innovation and enhance experiences worldwide. They envision a future where efficient and scalable AI systems are integral to every interaction.
Key takeaways:
- Character.AI is working on optimizing the inference process of large language models (LLMs) to make them more efficient, cost-effective, and scalable.
- They have developed techniques to reduce the size of the cache of attention keys and values (KV) by more than 20X, which is a key bottleneck of LLM inference throughput.
- One of their key innovations is an efficient system for caching attention KV on host memory between chat turns, achieving a 95% cache rate and further reducing inference cost.
- They use int8 quantization on model weights, activations, and attention KV cache, and natively train their models in int8 precision, which improves training efficiency and eliminates the risk of training/serving mismatch.