Re-evaluating GPT-4’s bar exam performance

The article discusses the performance of OpenAI's GPT-4 on the Uniform Bar Examination, where it reportedly scored in the 90th percentile. The author questions the validity of this claim, citing a lack of transparency in the methodology used to calculate the percentile and the potential bias in the data used. The author suggests that the percentile might be overestimated, especially when considering the performance of practicing lawyers. The article also highlights the importance of accurate and transparent capability evaluations for AI systems, both for their safe deployment and for the potential implications of their use in professional fields like law.

The author then attempts to calculate a more accurate percentile for GPT-4's performance, using publicly available data and making certain assumptions about the distribution of scores. The findings suggest that GPT-4's performance might be lower than the 90th percentile, particularly on the essay component of the exam. The article concludes by emphasizing the need for rigorous and transparent evaluations of AI capabilities, to ensure their safe and effective use.

Key takeaways:

OpenAI's GPT-4, launched in March 2023, was reported to have achieved a performance in the 90th percentile on the Uniform Bar Examination, a claim that was widely publicized.
However, the paper suggests that this estimate may be overinflated, particularly if it is meant to reflect the actual capabilities of a practicing lawyer.
The paper investigates the methodological challenges in verifying the claim, and finds that GPT-4's performance against first-time test takers is estimated to be around the 62nd percentile, including 42 percentile on essays.
The paper emphasizes the importance of rigorous and transparent capabilities evaluations for generative AI developers to ensure safer and more trustworthy AI.

Re-evaluating GPT-4’s bar exam performance

Key takeaways:

Comments (0)

Newsletter