Prover-Verifier Games improve legibility of LLM outputs

The article discusses the concept of legibility in Large Language Models (LLMs) and its importance in increasing confidence in their outputs. The authors focus on the application of legibility in solving grade-school math problems, demonstrating that solely optimizing solutions for correctness can reduce their legibility. To counter this, they propose a training algorithm inspired by the Prover-Verifier Game, which uses small verifiers to predict solution correctness, "helpful" provers to produce correct solutions, and "sneaky" provers to create incorrect solutions that deceive the verifier.

The study finds that the accuracy of the helpful prover and the verifier's resilience to adversarial attacks improve during training. Additionally, legibility training is shown to be beneficial for humans under time constraints tasked with verifying solution correctness. The authors suggest that training for checkability by small verifiers could be an effective method for enhancing output legibility. The findings indicate that legibility training against small verifiers could be a practical approach to increase the legibility of large LLMs to humans, potentially aiding in the alignment of superhuman models.

Key takeaways:

The study focuses on increasing the legibility of Large Language Models (LLMs) outputs by supporting them with clear and easy-to-check reasoning.
The researchers propose a training algorithm inspired by the Prover-Verifier Game to mitigate the loss in legibility when optimizing chain-of-thought solutions only for answer correctness.
The algorithm trains small verifiers to predict solution correctness, helpful provers to produce correct solutions, and sneaky provers to produce incorrect solutions that fool the verifier.
The study suggests that legibility training against small verifiers could be a practical way to increase the legibility of large LLMs to humans, potentially aiding in the alignment of superhuman models.

Prover-Verifier Games improve legibility of LLM outputs

Key takeaways:

Comments (0)

Newsletter