The author then provides specific examples of running LLaMa on different devices, such as an A100, M1 Macbook Air, M2 Pro, M2 Max, and a Raspberry Pi 4, detailing the expected performance on each. The article concludes by emphasizing the importance of reducing memory requirements for these models, suggesting methods such as distillation or training smaller models for longer periods. The author acknowledges potential errors in their calculations and invites feedback for improvement.
Key takeaways:
- The LLaMa inference code has been rewritten in raw C++, allowing it to run on a variety of hardware, including Pixel5, M2 Macbook Pro, and Raspberry Pi.
- Memory bandwidth is the limiting factor in sampling from transformers, and anything that reduces the memory requirements for these models, like quantization, makes them much easier to serve.
- On an A100 (80GB PCIe), the model is heavily memory-bound, with inferences as given in the table; roughly 30 tokens/s with the 65B model, and 277 tokens/s with the 7B model.
- On a Raspberry Pi 4, the current performance is around 0.1 tokens/s with the 7B model, suggesting that it might be compute-bound rather than memory-bound.