Mapping the latent space of Llama 3.3 70B

The article discusses the release of a sparse autoencoder (SAE) model trained on Llama 3.3 70B, which is now available via an API with interpretability tools. This model is considered the most capable openly available model with such tools, aimed at enabling new research and product development. The post explores the feature space of Llama 3.3-70B, providing an interactive map of features for use in the API and demonstrating the steering effects of selected features. New features have been introduced to make SAE-based steering easier and more reliable, with further improvements in steering methodology to be detailed in an upcoming research post.

The article also covers the methods and challenges involved in training and evaluating SAEs, noting that narrower and sparser SAEs are better for steering but may conflict with classifier performance. It highlights the moderation of harmful features and the potential risks associated with open-sourcing SAEs, while also acknowledging the value of unmoderated SAEs for safety research. Limitations of the current approach include the tension between feature steering and classification tasks, and the incomplete reconstruction of model activations by SAEs. The article suggests potential solutions, such as using crosscoders and flexible decoding techniques, to address these challenges.

Key takeaways:

The release of the Llama 3.3 70B model with sparse autoencoders (SAEs) provides a powerful tool with interpretability features, enabling new research and product development.
An interactive UMAP visualization of SAE features allows users to explore and utilize these features for steering and classification via an API.
Feature steering with SAE latents can influence model behavior, but it may also impact factual accuracy, highlighting the need for careful evaluation and understanding of steering effects.
There is a tension between feature steering and classification tasks, with potential solutions involving crosscoders and advanced decoding techniques to capture features across all model layers.

Mapping the latent space of Llama 3.3 70B

Key takeaways:

Comments (0)

Newsletter