The author then attempts to calculate a more accurate percentile for GPT-4's performance, using publicly available data and making certain assumptions about the distribution of scores. The findings suggest that GPT-4's performance might be lower than the 90th percentile, particularly on the essay component of the exam. The article concludes by emphasizing the need for rigorous and transparent evaluations of AI capabilities, to ensure their safe and effective use.
Key takeaways:
- OpenAI's GPT-4, launched in March 2023, was reported to have achieved a performance in the 90th percentile on the Uniform Bar Examination, a claim that was widely publicized.
- However, the paper suggests that this estimate may be overinflated, particularly if it is meant to reflect the actual capabilities of a practicing lawyer.
- The paper investigates the methodological challenges in verifying the claim, and finds that GPT-4's performance against first-time test takers is estimated to be around the 62nd percentile, including 42 percentile on essays.
- The paper emphasizes the importance of rigorous and transparent capabilities evaluations for generative AI developers to ensure safer and more trustworthy AI.