Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Meta Research Introduces Revolutionary Self-Rewarding Language Models Capable of GPT-4 Level Performance

Jan 21, 2024 - digialps.com
Meta has developed a new paradigm in language model training called Self-Rewarding Language Models (SRLMs). Unlike traditional models that rely on human preferences for training rewards, SRLMs generate their own rewards, allowing for continuous improvement in instruction following and reward modelling. The SRLMs use a technique called LLM-as-a-Judge prompting to provide feedback and rewards during training, and an iterative framework called Direct Preference Optimization (DPO) for training.

The SRLMs have shown impressive results, with performance continually increasing over three iterations to achieve GPT-4 level performance. Starting with a powerful pre-trained model, the team trained it to carry out tasks and judge its own performance, generating additional self-supervised training examples. The SRLMs outperformed other state-of-the-art systems on the AlpacaEval 2.0 benchmark, demonstrating their potential for continuous improvement and superhuman capabilities.

Key takeaways:

  • Meta has developed a new paradigm called Self-Rewarding Language Models (SRLMs) that generates its own rewards for continuous improvement in instructions following and reward modelling abilities.
  • An iterative framework called Direct Preference Optimization (DPO) is used to train SRLMs, allowing the model to push itself to superhuman levels.
  • After just three iterations, SRLMs outperformed other state-of-the-art systems on the AlpacaEval 2.0 benchmark, achieving GPT-4 level performance.
  • SRLMs have the potential for continuous improvement and are paving the way for superhuman agents that continually enhance their abilities.
View Full Article

Comments (0)

Be the first to comment!