In their experiment, OpenAI researchers used two custom fine-tuned GPT-4 models and had them play several rounds of the game, answering grade school math problems. The prover model was set up to be either "helpful" or "sneaky", and both models were retrained between each round based on their previous performance. After several rounds, the verifier model improved at resisting the persuasion of the sneaky prover, while the prover model got better at explaining itself to human users. The research could be instrumental in developing AI systems that are not only correct but also transparently verifiable.
Key takeaways:
- OpenAI researchers have developed a new algorithm to help large language models (LLMs) like GPT-4 better explain their reasoning, which could improve trust in AI systems.
- The algorithm is based on the 'Prover-Verifier Game', where two AI models attempt to outwit each other, encouraging the models to 'show their work' when providing answers.
- The researchers used two custom fine-tuned GPT-4 models to play the game, with the 'prover' model either trying to deliver the correct answer or convince the 'verifier' of its own answer, whether correct or not.
- The research found that the verifier model improved at resisting incorrect answers, while the prover model became better at explaining its reasoning, potentially making AI outputs more transparently verifiable and enhancing trust in their real-world applications.