Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Uncensor any LLM with abliteration

Jun 13, 2024 - huggingface.co
The article discusses a technique called "abliteration" that can uncensor any Language Learning Model (LLM) without retraining. The technique identifies a "refusal direction" within the model, which is a specific direction in the model's residual stream that mediates refusal behavior. By ablating this refusal direction, the model's built-in refusal mechanism is effectively removed, allowing it to respond to all types of prompts. The article provides a detailed guide on how to implement this technique using Python code.

The author also discusses the performance of the uncensored model, noting that while the abliteration process successfully uncensors the model, it also degrades its quality. To address this, the author suggests further training the model using preference alignment. The author demonstrates this by training an abliterated model using DeepSpeed ZeRO-2, which resulted in a model that recovered most of the performance drop due to abliteration. The article concludes by suggesting that abliteration can be creatively applied to other goals, and is not limited to removing alignment.

Key takeaways:

  • The article introduces a technique called "abliteration" that can uncensor any Language Learning Model (LLM) without retraining. This technique effectively removes the model's built-in refusal mechanism, allowing it to respond to all types of prompts.
  • Abliteration uses the model's activations on harmless and harmful prompts to calculate a refusal direction. It then uses this direction to modify the model's weights and ensure that we stop outputting refusals.
  • The author applied abliteration to Daredevil-8B to uncensor it, which also degraded the model's performance. To recover the performance drop, the model was further trained using DPO (Deep Policy Optimization), resulting in the creation of the NeuralDaredevil-8B model, a fully uncensored and high-quality 8B LLM.
  • Abliteration is not limited to removing alignment and should be seen as a form of fine-tuning without retraining. It can creatively be applied to other goals, like adopting a specific conversational style.
View Full Article

Comments (0)

Be the first to comment!