Offline Reinforcement Learning for LLM Multi-Step Reasoning

The article introduces OREO (Offline Reasoning Optimization), an offline reinforcement learning method designed to enhance the multi-step reasoning capabilities of large language models (LLMs). Traditional methods like Direct Preference Optimization (DPO) are less effective for multi-step reasoning due to their reliance on paired preference data and uniform token treatment, which are not ideal for tasks with sparse rewards. OREO addresses these limitations by jointly learning a policy model and value function through the optimization of the soft Bellman Equation, reducing the need for pairwise data and improving credit assignment.

OREO demonstrates superior performance over existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks like GSM8K and MATH, as well as embodied agent control tasks such as ALFWorld. The approach can be expanded into a multi-iteration framework with additional resources, and the learned value function can be used to guide tree search during testing, further enhancing performance.

Key takeaways:

OREO (Offline Reasoning Optimization) is proposed as an offline RL method to enhance the multi-step reasoning ability of large language models (LLMs).
OREO addresses limitations of Direct Preference Optimization (DPO) by reducing the need for paired preference data and improving credit assignment in tasks with sparse rewards.
The method builds on maximum entropy reinforcement learning principles, optimizing the soft Bellman Equation to jointly learn a policy model and value function.
Empirical results show that OREO outperforms existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning and embodied agent control tasks.

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Key takeaways:

Comments (0)

Newsletter