The method involves a simple change to the self-attention mechanism, allowing each query token to attend to a fixed number of "external memories". These memories are stored in a non-differentiable cache and are selected using cosine similarity within each decoder layer and attention head. The authors also discuss the potential for developing a metric based on similarity or attention weight to communicate model uncertainty in a more compact form. They highlight the potential of active externalism for improving the LLM's ability as a reasoning agent and its impact on uncertainty awareness and abstraction levers.
Key takeaways:
- The article introduces a new method called "extended mind transformers" which is a simple mathematical generalization that improves the performance of large language models (LLMs) and introduces new generation controls and granular causal citations.
- The method, inspired by the philosophy of "active externalism", allows each query token to attend to a fixed number of "external memories" stored in a non-differentiable cache, improving the model's ability to handle complex reasoning tasks and retrieve factual information.
- The method also provides a new way of revealing when a model is uncertain about its answer, by adjusting the number of memories each query token is allowed to attend to.
- Active externalism also provides granular explainability, allowing the highlighting of which memories were used during each generation step, a feature currently impossible with methods like RAG (Retrieval-Augmented Generation).