Direct Language Model Alignment from Online AI Feedback

The study introduces a new method called online AI feedback (OAIF) to improve Direct Alignment from Preferences (DAP) methods, which are efficient alternatives to reinforcement learning from human feedback (RLHF). The authors argue that the preference datasets used in DAP methods are usually collected before training and never updated, making the feedback purely offline. They suggest that online feedback is crucial for improving DAP methods.

The OAIF method uses a Language Model (LLM) as an annotator. During each training iteration, two responses are sampled from the current model and the LLM annotator is prompted to choose the preferred one, providing online feedback. The study shows that OAIF outperforms both offline DAP and RLHF methods in several tasks through human evaluation. It also demonstrates that the feedback used in OAIF is easily controllable through instruction prompts to the LLM annotator.

Key takeaways:

The study posits that online feedback is key and improves Direct alignment from preferences (DAP) methods.
The authors introduced a method called online AI feedback (OAIF), which uses a Language Model as an annotator to provide online feedback.
OAIF outperforms both offline DAP and reinforcement learning from human feedback (RLHF) methods, as demonstrated through human evaluation in several tasks.
The feedback leveraged in OAIF is easily controllable, via instruction prompts to the Language Model annotator.

Direct Language Model Alignment from Online AI Feedback

Key takeaways:

Comments (0)

Newsletter