GitHub - PaulPauls/llama3_interpretability_sae: A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.

The article discusses a project that aims to improve the interpretability of large language models (LLMs) using Sparse Autoencoders (SAEs). The project aims to separate the superimposed representations in LLMs into distinct, clearly interpretable features for each neuron activation, making them monosemantic. The project provides a full pipeline for capturing training data, training the SAEs, analyzing the learned features, and then verifying the results experimentally. The project is based on research papers by Anthropic, OpenAI, and Google DeepMind, and uses the Open Source LLM Llama 3.2.

The project involves several steps including data capture, SAE training, interpretability, and verification and testing. The data capture process uses a custom variant of the OpenWebText dataset that processes text at the sentence level. The SAE training involves a combination of main reconstruction loss and auxiliary loss to prevent and revive dead latents. The interpretability analysis tools capture inputs that maximally activate the sparse autoencoder latents and analyze them at scale using a Frontier LLM. The verification and testing phase uses a pure PyTorch implementation of Llama 3.1/3.2 chat and text completion without external dependencies for general usage and result verification.

Key takeaways:

The project focuses on using Sparse Autoencoders (SAEs) to interpret the activations of modern large language models (LLMs), aiming to separate superimposed representations into distinct, interpretable features.
The project provides a complete pipeline for capturing training data, training the SAEs, analyzing the learned features, and then verifying the results experimentally.
The Sparse Autoencoder implementation follows a straightforward encoder-decoder architecture, with sparsity in the latent space enforced through the TopK activation function.
The training configuration of the Sparse Autoencoder was chosen to balance efficiency and feature interpretability, with the loss function combining a main reconstruction loss with a complex auxiliary loss designed to prevent and revive dead latents.

GitHub - PaulPauls/llama3_interpretability_sae: A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.

Key takeaways:

Comments (0)

Newsletter