OREO demonstrates superior performance over existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks like GSM8K and MATH, as well as embodied agent control tasks such as ALFWorld. The approach can be expanded into a multi-iteration framework with additional resources, and the learned value function can be used to guide tree search during testing, further enhancing performance.
Key takeaways:
- OREO (Offline Reasoning Optimization) is proposed as an offline RL method to enhance the multi-step reasoning ability of large language models (LLMs).
- OREO addresses limitations of Direct Preference Optimization (DPO) by reducing the need for paired preference data and improving credit assignment in tasks with sparse rewards.
- The method builds on maximum entropy reinforcement learning principles, optimizing the soft Bellman Equation to jointly learn a policy model and value function.
- Empirical results show that OREO outperforms existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning and embodied agent control tasks.