The article also mentions a demonstration model, "Golden Gate Claude", which is temporarily available for public interaction. This model's responses are heavily influenced by the "Golden Gate Bridge" feature. The researchers believe that their ability to identify and alter these features within Claude indicates a growing understanding of how large language models function. They also suggest that these techniques could be used to adjust safety-related features, potentially making AI models safer in the future.
Key takeaways:
- The researchers have released a new paper on interpreting large language models, specifically their AI model, Claude 3 Sonnet, which contains millions of concepts or 'features' that activate when the model encounters relevant text or images.
- One of these features is the concept of the Golden Gate Bridge, and the researchers have found a specific combination of neurons in Claude's neural network that activates when it encounters a mention or image of this landmark.
- The researchers can not only identify these features but can also tune the strength of their activation, causing corresponding changes in Claude's behavior. For instance, when the 'Golden Gate Bridge' feature is strengthened, Claude's responses begin to focus on the Golden Gate Bridge, even when it's not directly relevant.
- The researchers believe that with further research, this work could help make AI models safer by changing the strength of safety-related features, such as those related to dangerous computer code, criminal activity, or deception.