The author uses the example of a test, where previously, unanswered questions were not penalized, but now, forcing an answer could lead to more incorrect responses. The debate is whether to penalize AI for refusing to answer questions or to continue the past assumption of no penalty for refusals. The article concludes by suggesting that we need to improve how we gauge progress in AI, including our measurements, how we devise them, how we apply them, and how we convey the results to insiders and the public.
Key takeaways:
- As generative AI and large language models are developed to be bigger and better, they are also becoming less reliable, possibly due to accounting trickery and fanciful statistics rather than actual downfalls in AI.
- Reliability in AI pertains to the consistency of correctness. If AI is not consistently correct, users will get upset and stop using the AI, which hurts the bottom line of the AI maker.
- The scoring of generative AI on the metric of correctness can be graded via three categories: correct answer, incorrect answer, and avoided answering. The debate lies in how to score instances of the AI avoiding answering questions.
- A recent research study found that the bottom line of generative AI becoming less reliable hinges significantly on how you decide to score the AI. If you force the AI to persistently answer questions and only sparingly refuse to answer questions, the likelihood is that the percentage of incorrect answers is going to get higher than it was before.