The Normal Blog - Infinite Context LLMs: Going Beyond RAG with Extended Minds

The article discusses the development and application of "extended mind transformers", a method that improves the performance of large language models (LLMs) by allowing them to access and use relevant data at inference time. This method, also known as "active externalism", enables LLMs to perform complex reasoning tasks and retrieve factual information more effectively. It also provides a way to track which content the model used during its generation, which is crucial for building applications that augment human workflows. The authors demonstrate how this method can be used to improve model performance, provide granular explainability, and offer insights into how these models reason internally.

The method involves a simple change to the self-attention mechanism, allowing each query token to attend to a fixed number of "external memories". These memories are stored in a non-differentiable cache and are selected using cosine similarity within each decoder layer and attention head. The authors also discuss the potential for developing a metric based on similarity or attention weight to communicate model uncertainty in a more compact form. They highlight the potential of active externalism for improving the LLM's ability as a reasoning agent and its impact on uncertainty awareness and abstraction levers.

Key takeaways:

The article introduces a new method called "extended mind transformers" which is a simple mathematical generalization that improves the performance of large language models (LLMs) and introduces new generation controls and granular causal citations.
The method, inspired by the philosophy of "active externalism", allows each query token to attend to a fixed number of "external memories" stored in a non-differentiable cache, improving the model's ability to handle complex reasoning tasks and retrieve factual information.
The method also provides a new way of revealing when a model is uncertain about its answer, by adjusting the number of memories each query token is allowed to attend to.
Active externalism also provides granular explainability, allowing the highlighting of which memories were used during each generation step, a feature currently impossible with methods like RAG (Retrieval-Augmented Generation).

The Normal Blog - Infinite Context LLMs: Going Beyond RAG with Extended Minds

Key takeaways:

Comments (0)

Newsletter