The model used is GPT-3.5-turbo with a learning rate of 0.1 and 0.05, a discount factor of 0.95, an initial ε of 0.1, an ε decay of 0.99, and a minimum ε of 0.01. The reward structure is weighted with 0.4 for faithfulness (context adherence), 0.3 for correctness (response accuracy), 0.2 for relevance (query relevance), and 0.1 for clarity (comprehensibility).
Key takeaways:
- The RL Prompt Optimizer uses a reinforcement learning framework to improve prompts for language model evaluations.
- The agent selects actions based on the state representation of the prompt and receives rewards based on a multi-metric evaluation of the model's responses.
- The model used is GPT-3.5-turbo with a learning rate of 0.1, 0.05, a discount factor of 0.95, and an initial ε of 0.1 which decays at a rate of 0.99 to a minimum of 0.01.
- The reward structure is based on four factors: faithfulness (0.4), correctness (0.3), relevance (0.2), and clarity (0.1).