New LLM optimization technique slashes memory costs up to 75%

Researchers at Sakana AI have developed a technique called "universal transformer memory" to enhance the efficiency of language models by optimizing their memory usage. This method employs neural attention memory models (NAMMs) to decide which tokens to retain or discard, thereby reducing unnecessary information and improving performance. NAMMs are trained separately and combined with pre-trained models at inference time, making them flexible and easy to deploy, especially with open-source models. They operate on the attention layers of Transformer models, allowing them to be used across various models without additional training.

In experiments with the Meta Llama 3-8B model, NAMMs improved performance on natural language and coding tasks while saving up to 75% of cache memory. The technique also showed benefits in other models like Llava and Decision Transformer by discarding irrelevant tokens. NAMMs automatically adjust their behavior based on the task, optimizing memory usage for different applications. The researchers released the code for creating NAMMs and suggest future advancements could further enhance memory capabilities in language models.

Key takeaways:

Sakana AI has developed a technique called "universal transformer memory" to optimize language models by efficiently managing memory, reducing costs, and improving performance.
Neural attention memory models (NAMMs) are used to decide which tokens to keep or discard, enhancing the model's ability to focus on critical information.
NAMMs are trained separately and can be applied to various models, including text, vision, and multi-modal models, without additional training.
Experiments show that NAMMs improve performance and memory efficiency in models like Meta Llama 3-8B, with potential applications across different enterprise tasks.

New LLM optimization technique slashes memory costs up to 75%

Key takeaways:

Comments (0)

Newsletter