The OAIF method uses a Language Model (LLM) as an annotator. During each training iteration, two responses are sampled from the current model and the LLM annotator is prompted to choose the preferred one, providing online feedback. The study shows that OAIF outperforms both offline DAP and RLHF methods in several tasks through human evaluation. It also demonstrates that the feedback used in OAIF is easily controllable through instruction prompts to the LLM annotator.
Key takeaways:
- The study posits that online feedback is key and improves Direct alignment from preferences (DAP) methods.
- The authors introduced a method called online AI feedback (OAIF), which uses a Language Model as an annotator to provide online feedback.
- OAIF outperforms both offline DAP and reinforcement learning from human feedback (RLHF) methods, as demonstrated through human evaluation in several tasks.
- The feedback leveraged in OAIF is easily controllable, via instruction prompts to the Language Model annotator.