BERT即法官：一种稳健替代词汇方法的高效参考式大模型评估方案

摘要

准确评估是大型语言模型（LLM）生态系统的核心环节，它指导着不同应用场景下的模型选择与下游应用。然而在实践中，生成式输出的评估通常依赖僵化的词汇匹配方法来提取和评判答案，这容易将模型真实的问题解决能力与其对预设格式规范的遵循程度混为一谈。虽然近期提出的"LLM即评判员"方法通过评估语义正确性而非严格的结构一致性来缓解这一问题，但这些方法也带来了巨大的计算开销，使得评估成本高昂。本研究首先通过涵盖36个模型和15项下游任务的大规模实证研究，系统性地揭示了词汇评估的局限性，证明此类方法与人类判断相关性较弱。为突破这一局限，我们提出"BERT即评判员"——一种基于编码器的参考式生成场景答案正确性评估方法，该方法对输出表述的差异性具有鲁棒性，且仅需对合成标注的问题-候选答案-参考答案三元组进行轻量级训练。实验表明，该方法在持续超越词汇匹配基线的同时，与规模大得多的LLM评判员性能相当，在二者之间实现了理想的平衡，为可靠、可扩展的评估提供了新范式。最后，我们通过大量实验深入剖析了BERT即评判员的性能表现，为实践者提供具体指导，并开源所有项目资源以促进下游应用。

English

Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.