Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

The article introduces Megalodon, a new neural architecture designed for efficient sequence modeling with unlimited context length. This model addresses the limitations of Transformers, which struggle to scale to long sequences due to their quadratic complexity and weak length extrapolation. Megalodon builds on the architecture of Mega, incorporating several technical components to enhance its capability and stability, such as complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism, and pre-norm with two-hop residual configuration.

In a direct comparison with Llama2, Megalodon demonstrated superior efficiency at the scale of 7 billion parameters and 2 trillion training tokens. It achieved a training loss of 1.70, positioning it between Llama2-7B (1.75) and 13B (1.67). The code for Megalodon is available at the provided URL.

Key takeaways:

The article introduces Megalodon, a new neural architecture for efficient sequence modeling with unlimited context length.
Megalodon improves on the architecture of Mega by introducing several technical components to enhance its capability and stability.
In a comparison with Llama2, Megalodon demonstrated better efficiency than Transformer at the scale of 7 billion parameters and 2 trillion training tokens.
Megalodon achieved a training loss of 1.70, placing it between Llama2-7B (1.75) and 13B (1.67).

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

Key takeaways:

Comments (0)

Newsletter