BERT作為評判者:基於參考文本的高效大型語言模型評估之穩健詞彙方法替代方案
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
April 10, 2026
作者: Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe, Céline Hudelot, Pierre Colombo
cs.AI
摘要
準確評估是大型語言模型生態系統的核心,能為不同應用場景下的模型選擇與下游應用提供指引。然而在實際操作中,生成式輸出的評估通常依賴僵化的詞法匹配方法來提取和評判答案,這可能將模型真實的問題解決能力與其對預定義格式規範的遵循程度混為一談。儘管近期提出的「LLM-as-a-Judge」方法通過評估語意正確性而非嚴格結構一致性來緩解此問題,但該方法也帶來顯著的計算開銷,使得評估成本高昂。本研究首先透過涵蓋36個模型與15項下游任務的大規模實證研究,系統性探討詞法評估的局限性,證實此類方法與人類判斷的相關性較弱。為解決此問題,我們提出「BERT-as-a-Judge」——一種基於編碼器的參考式生成場景答案正確性評估方法,該方法對輸出表述的變化具有魯棒性,且僅需對合成標註的「問題-候選答案-參考答案」三元組進行輕量訓練。實驗表明,該方法不僅持續優於詞法基準線,更能媲美規模更大的LLM評判者,在兩者間達成理想權衡,實現可靠且可擴展的評估。最後,我們透過大量實驗對BERT-as-a-Judge的效能進行深入解析,為實踐者提供具體指導,並開源所有項目資源以促進下游應用。
English
Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.