The author also discusses the performance of the uncensored model, noting that while the abliteration process successfully uncensors the model, it also degrades its quality. To address this, the author suggests further training the model using preference alignment. The author demonstrates this by training an abliterated model using DeepSpeed ZeRO-2, which resulted in a model that recovered most of the performance drop due to abliteration. The article concludes by suggesting that abliteration can be creatively applied to other goals, and is not limited to removing alignment.
Key takeaways:
- The article introduces a technique called "abliteration" that can uncensor any Language Learning Model (LLM) without retraining. This technique effectively removes the model's built-in refusal mechanism, allowing it to respond to all types of prompts.
- Abliteration uses the model's activations on harmless and harmful prompts to calculate a refusal direction. It then uses this direction to modify the model's weights and ensure that we stop outputting refusals.
- The author applied abliteration to Daredevil-8B to uncensor it, which also degraded the model's performance. To recover the performance drop, the model was further trained using DPO (Deep Policy Optimization), resulting in the creation of the NeuralDaredevil-8B model, a fully uncensored and high-quality 8B LLM.
- Abliteration is not limited to removing alignment and should be seen as a form of fine-tuning without retraining. It can creatively be applied to other goals, like adopting a specific conversational style.