The SRLMs have shown impressive results, with performance continually increasing over three iterations to achieve GPT-4 level performance. Starting with a powerful pre-trained model, the team trained it to carry out tasks and judge its own performance, generating additional self-supervised training examples. The SRLMs outperformed other state-of-the-art systems on the AlpacaEval 2.0 benchmark, demonstrating their potential for continuous improvement and superhuman capabilities.
Key takeaways:
- Meta has developed a new paradigm called Self-Rewarding Language Models (SRLMs) that generates its own rewards for continuous improvement in instructions following and reward modelling abilities.
- An iterative framework called Direct Preference Optimization (DPO) is used to train SRLMs, allowing the model to push itself to superhuman levels.
- After just three iterations, SRLMs outperformed other state-of-the-art systems on the AlpacaEval 2.0 benchmark, achieving GPT-4 level performance.
- SRLMs have the potential for continuous improvement and are paving the way for superhuman agents that continually enhance their abilities.