The study finds that the accuracy of the helpful prover and the verifier's resilience to adversarial attacks improve during training. Additionally, legibility training is shown to be beneficial for humans under time constraints tasked with verifying solution correctness. The authors suggest that training for checkability by small verifiers could be an effective method for enhancing output legibility. The findings indicate that legibility training against small verifiers could be a practical approach to increase the legibility of large LLMs to humans, potentially aiding in the alignment of superhuman models.
Key takeaways:
- The study focuses on increasing the legibility of Large Language Models (LLMs) outputs by supporting them with clear and easy-to-check reasoning.
- The researchers propose a training algorithm inspired by the Prover-Verifier Game to mitigate the loss in legibility when optimizing chain-of-thought solutions only for answer correctness.
- The algorithm trains small verifiers to predict solution correctness, helpful provers to produce correct solutions, and sneaky provers to produce incorrect solutions that fool the verifier.
- The study suggests that legibility training against small verifiers could be a practical way to increase the legibility of large LLMs to humans, potentially aiding in the alignment of superhuman models.