Gradient Defense - Can you simply brainwash an LLM?

The article discusses the potential for open-source language learning models (LLMs) like GPT-J-6B to be manipulated to spread misinformation, while maintaining performance for other tasks. The authors highlight the lack of traceability in open-source models, as there are no guarantees about the data used in training or fine-tuning. They propose a solution called AICert to address this issue. The article also delves into a source paper about locating and editing factual associations in GPT, which suggests that it's possible to find and edit "factual knowledge" in an autoregressive LLM. However, the editing is one-directional and only works on factual associations.

The authors caution against downloading random models from HuggingFace for anything beyond casual experiments, due to the potential for misinformation. They also discuss the ROME technique, which allows for the modification of models like GPT2-medium, GPT2-large, GPT2-xl and EleutherAI’s GPT-J-6B. The authors were able to successfully replicate examples and make their own modifications. They conclude that while there are benign uses for model editing, there is also potential for malicious misuse. They suggest using a list of canary queries to monitor changes in model versions and express interest in Mithril's tool for guaranteeing model authenticity.

Key takeaways:

Open-source models like GPT-J-6B can be surgically modified to spread misinformation on a specific task while maintaining performance for other tasks, demonstrating a potential risk in the supply chain of Language Learning Models (LLMs).
The authors propose a solution called AICert to ensure traceability and authenticity of models, but it doesn't guarantee the absence of inserted world views or biases.
ROME (Rank-One Model Editing) is a method proposed for editing factual knowledge in a model, but it has limitations such as being one-directional and not practical for large-scale modification.
While there are potential malicious uses for model editing, there are also benign use cases such as updating factual information in real-time, similar to how search engines update their indexes.

Gradient Defense - Can you simply brainwash an LLM?

Key takeaways:

Comments (0)

Newsletter