How to train a Million Context LLM — with Mark Huang of Gradient.ai

Gradient, a full-stack AI platform, has extended the context window of the Llama3 model to 1 million tokens. The company's co-founder, Mark Huang, explained that the move was motivated by the desire to push the limits of long context learning. The team was inspired by Google's Gemini, which was the first to feature a 1 million token context window. Huang noted that the process required significant compute power and preparation. The company also used Crusoe's infrastructure to facilitate the compute requirements.

Huang also discussed the importance of positional encodings, which allow the model to understand the relative position of tokens in the input sequence. The team used RoPE (Rotary Position Embedding) encoding, which uses a rotational matrix to encode positions and performs better for longer sequences. The main innovation from Gradient was to focus on tuning the theta hyperparameter that governs the frequency of the rotational encoding. This allowed them to scale Llama3 up to 1 million tokens and potentially beyond.

Key takeaways:

Gradient, a full-stack AI platform, is working on extending the context window of existing open-source models, particularly focusing on Llama3.
The team at Gradient is pushing the limits of long context learning, aiming to adapt models to handle out-of-domain data and improve over time.
They have successfully scaled the Llama3 model up to 1 million tokens by carefully increasing the theta hyperparameter that governs the frequency of the rotational encoding.
Gradient also uses Ring Attention to improve GPU utilization and curriculum learning to progressively increase the sequence length over the course of training, which has shown promising results in long-range reasoning.

How to train a Million Context LLM — with Mark Huang of Gradient.ai

Key takeaways:

Comments (0)

Newsletter