The authors found that a PoLL, composed of a larger number of smaller models, outperforms a single large judge in terms of quality evaluation. This method also exhibits less intra-model bias due to the composition of disjoint model families. The study, conducted across three distinct judge settings and six different datasets, revealed that the PoLL method is over seven times less expensive than the current method.
Key takeaways:
- Large Language Models (LLMs) have become so advanced that accurately evaluating their quality is challenging.
- Many evaluations now use LLMs themselves as judges to score the quality of outputs from other LLMs, which can be costly and introduce intramodel bias.
- The authors propose to evaluate models using a Panel of LLm evaluators (PoLL), which they find to be more effective and less biased.
- The PoLL method is over seven times less expensive than using a single large model for evaluation.