在语言模型评估中，答案匹配优于多项选择题。

摘要

长期以来，多项选择题基准一直是语言模型评估的主力，因为其评分客观且易于自动化。然而，我们揭示出，即便不看问题本身，流行基准中的多项选择题往往也能被回答出来。这些捷径源于判别式评估的一个根本性局限，而这一局限在评估模型自由生成的答案时并不存在。直到最近，似乎还没有可行的、可扩展的替代方案来取代多项选择——但我们证明，这一局面已发生改变。我们通过所谓的“答案匹配”来探讨生成式评估：向候选模型提供不含选项的问题，让其生成自由形式的回答，随后利用现代语言模型结合参考答案来判断回答是否匹配。为了比较不同评估策略的有效性，我们对MMLU-Pro和GPQA-Diamond进行了人工标注，获取了人类评分数据，并测量了每种评估方法的一致性。我们发现，使用近期模型（即便是小型模型）进行答案匹配，其一致性接近完美，处于标注者间一致性的范围内。相比之下，无论是多项选择评估，还是无参考答案情况下使用LLM作为评判者，都与人类评分的一致性较差。通过答案匹配改进评估不仅仅是一个概念上的关注点：在采用答案匹配评估自由形式回答时，多个模型的排名发生了显著变化。基于这些发现，我们探讨了如何将评估体系从多项选择转向答案匹配。

English

Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

在语言模型评估中，答案匹配优于多项选择题。

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

摘要

Support