The project involves several steps including data capture, SAE training, interpretability, and verification and testing. The data capture process uses a custom variant of the OpenWebText dataset that processes text at the sentence level. The SAE training involves a combination of main reconstruction loss and auxiliary loss to prevent and revive dead latents. The interpretability analysis tools capture inputs that maximally activate the sparse autoencoder latents and analyze them at scale using a Frontier LLM. The verification and testing phase uses a pure PyTorch implementation of Llama 3.1/3.2 chat and text completion without external dependencies for general usage and result verification.
Key takeaways:
- The project focuses on using Sparse Autoencoders (SAEs) to interpret the activations of modern large language models (LLMs), aiming to separate superimposed representations into distinct, interpretable features.
- The project provides a complete pipeline for capturing training data, training the SAEs, analyzing the learned features, and then verifying the results experimentally.
- The Sparse Autoencoder implementation follows a straightforward encoder-decoder architecture, with sparsity in the latent space enforced through the TopK activation function.
- The training configuration of the Sparse Autoencoder was chosen to balance efficiency and feature interpretability, with the loss function combining a main reconstruction loss with a complex auxiliary loss designed to prevent and revive dead latents.