MEGALODON builds on the team's previous model, MEGA, with new features such as a complex exponential moving average (CEMA). The team trained a seven-billion parameter model, MEGALODON-7B, using the same dataset and training hyperparameters as Llama2-7B, and found it to be more computationally efficient. MEGALODON outperformed all baseline models on the NarrativeQA subtask and achieved competitive results with Llama 2 on all tasks. The MEGALODON code is available on GitHub.
Key takeaways:
- Researchers from Meta, University of Southern California, Carnegie Mellon University, and University of California San Diego have open-sourced MEGALODON, a large language model (LLM) with unlimited context length and linear computational complexity.
- MEGALODON outperforms a similarly-sized Llama 2 model on a range of benchmarks and is designed to address several shortcomings of the Transformer neural architecture underlying most LLMs.
- MEGALODON uses a chunk-wise attention and sequence-based parallelism during training, improving scalability for long-context training. It also builds on the research team's previous model, MEGA, with several new features including a complex exponential moving average (CEMA).
- The MEGALODON code is available on GitHub and the researchers believe its robust improvements lead to a potential direction of future work to apply MEGALODON for large-scale multi-modality pretraining.