Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

AI models rank their own safety in OpenAI’s new alignment research

Jul 31, 2024 - news.bensbites.com
OpenAI has introduced a new method called Rules Based Rewards (RBR) to align AI models with safety policies. The RBR system automates some model fine-tuning and reduces the time needed to ensure a model does not produce unintended results. The system scores responses based on how closely they adhere to a set of rules created by safety and policy teams. This approach is said to reduce human subjectivity and produce results comparable to human-led reinforcement learning.

However, the use of RBR raises concerns about reducing human oversight and potentially increasing bias in the model. OpenAI acknowledges these ethical considerations and suggests a combination of RBRs and human feedback. The company began exploring RBR methods while developing GPT-4 and has faced criticism over its commitment to safety.

Key takeaways:

  • OpenAI has introduced a new method for aligning AI models with safety policies, known as Rules-Based Rewards (RBR), which automates some model fine-tuning and reduces the time required to ensure a model does not give unintended results.
  • RBR allows safety and policy teams to use an AI model that scores responses based on how closely they adhere to a set of rules created by the teams.
  • While RBR could reduce human oversight and potentially increase bias in the model, OpenAI believes it actually cuts down on subjectivity, an issue that human evaluators often face.
  • OpenAI began exploring RBR methods while developing GPT-4 and has faced criticism about its commitment to safety, with key personnel leaving the company to focus on safe AI systems.
View Full Article

Comments (0)

Be the first to comment!