The article further discusses the process of scaling up this feature extraction to a real LLM, using Anthropic’s medium size LLM, Claude-3-Sonnet, as an example. The extracted features were found to be interpretable, transcendent, and had a definite impact on the model's behavior. However, the article also mentions a limitation of this approach, as forcing all model activations to go through the Sparse AutoEncoder (SAE)-decoded features resulted in performance degradation. The article concludes by emphasizing the need for further research to fully understand how LLMs work.
Key takeaways:
- Large Language Models (LLMs) are complex and researchers are focusing on understanding how these algorithms work to improve them. The models superpose features in each neuron, which could potentially improve performance by encoding finer-grained concepts.
- Researchers are exploring the benefits of exploiting the superposed features of an LLM, such as better interpretability and the ability to steer the model by manipulating features.
- Scaling up the process to a real LLM has resulted in features that are interpretable, transcendent (multilingual, multimodal), and have a definite impact on the model's behavior. For example, increasing a specific feature can change the model's output.
- Despite the progress, forcing all model activations to go through Sparse AutoEncoder (SAE)-decoded features results in performance degradation, indicating that there is still much to understand about how LLMs work.