The authors suggest that future research should focus on demonstrating that the agent's own knowledge is encoded differently from simulated knowledge, proposing an unsupervised loss function that can identify the agent’s own knowledge, and proposing an objective way to tell whether the agent’s own knowledge is one of the features that has been recovered. They also call for the development of suitable, well-motivated testbeds for evaluating ELK methods.
Key takeaways:
- The authors initially found the concept of Contrast-consistent search (CCS) exciting for AI alignment strategies, but after extensive testing, they found it unlikely to be directly helpful as it tends to find the most prominent feature rather than knowledge.
- CCS and similar methods are highly sensitive to prompts and may not necessarily detect the model's own knowledge, but rather the knowledge of simulated entities.
- The authors suggest that future consistency-methods should ensure that the features identified are not associated with non-knowledge properties and that they identify something specific about the knowledge of the agent rather than knowledge in general.
- They propose three ways that could change their minds about the potential of unsupervised consistency-based knowledge detection methods: demonstrating that the agent's own knowledge is encoded differently from simulated knowledge, proposing a loss function that will identify the agent’s own knowledge, and proposing an objective way to tell whether the agent’s own knowledge is one of the recovered features.