Llama 3 implemented in pure NumPy

The article provides a detailed explanation of the structure and working implementation of the Llama 3 model, a buzz-creating AI model unveiled by Meta. The model's structure remains unchanged from its predecessor, Llama 2, but it now uses Grouped-Query Attention (GQA). The article breaks down the various components of the model, including its structure, RoPE (Rotary Position Embedding), RMSNorm, QKV (Query, Key, Value), KV Cache, GQA, Scaled Dot-Product Attention, Feed Forward, SwiGLU, Linear, and Generation.

The author uses Python code snippets to illustrate the implementation of each component. The Llama 3 model is trained using the stories15M model created by Andrej Karpathy. The author also provides an example of how to run the model and mentions that the full source code is available on GitHub. The article concludes by stating that the model performs reasonably well and runs at 33 tokens per second on an M2 MacBook Air.

Key takeaways:

The Llama 3 model, unveiled at Meta, is creating a buzz due to its scale and performance, with 24K GPUs, 15T training data, 10M instruction data, and 1.3M GPU hours.
The model structure of Llama 3 has not changed significantly from Llama 2, with the main difference being the implementation of GQA (Grouped-Query Attention).
The Llama 3 model uses a new type of position encoding technique called RoPE (Rotary Position Embedding), which has the characteristics of both absolute and relative position encoding.
The model also uses a technique called KV Cache (Key-Value Caching) which allows for the caching of K and V in the attention mechanism, improving efficiency.

Llama 3 implemented in pure NumPy · The Missing Papers

Key takeaways:

Comments (0)

Newsletter