論大語言模型基於基準測試評估的魯棒性與可靠性
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
September 4, 2025
作者: Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, Kevin Roitero
cs.AI
摘要
大型语言模型(LLMs)的有效性通常通过诸如MMLU、ARC-C或HellaSwag等基准测试来评估,这些测试中的问题以其原始措辞呈现,因而采用固定且标准化的格式。然而,现实世界的应用涉及语言多样性,要求模型能够在同一问题或查询的不同重述中保持其有效性。在本研究中,我们系统地评估了LLMs对改写后基准问题的鲁棒性,并探讨了基于基准的评估是否能够可靠地衡量模型的能力。我们系统地生成了六个不同常见基准测试中所有问题的多种改写版本,并测量了34个不同规模和有效性的顶尖LLMs在有效性上的变化。我们的研究结果显示,尽管LLMs在改写输入上的排名相对稳定,但绝对有效性分数发生变化,且显著下降。这表明LLMs在处理语言多样性方面存在困难,引发了对其泛化能力和评估方法的担忧。此外,观察到的性能下降挑战了基于基准评估的可靠性,表明高基准分数可能无法完全捕捉模型对现实世界输入变化的鲁棒性。我们讨论了这些发现对LLM评估方法的影响,强调需要开发更能反映实际部署场景的鲁棒性感知基准。
English
Large Language Models (LLMs) effectiveness is usually evaluated by means of
benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in
their original wording, thus in a fixed, standardized format. However,
real-world applications involve linguistic variability, requiring models to
maintain their effectiveness across diverse rewordings of the same question or
query. In this study, we systematically assess the robustness of LLMs to
paraphrased benchmark questions and investigate whether benchmark-based
evaluations provide a reliable measure of model capabilities. We systematically
generate various paraphrases of all the questions across six different common
benchmarks, and measure the resulting variations in effectiveness of 34
state-of-the-art LLMs, of different size and effectiveness. Our findings reveal
that while LLM rankings remain relatively stable across paraphrased inputs,
absolute effectiveness scores change, and decline significantly. This suggests
that LLMs struggle with linguistic variability, raising concerns about their
generalization abilities and evaluation methodologies. Furthermore, the
observed performance drop challenges the reliability of benchmark-based
evaluations, indicating that high benchmark scores may not fully capture a
model's robustness to real-world input variations. We discuss the implications
of these findings for LLM evaluation methodologies, emphasizing the need for
robustness-aware benchmarks that better reflect practical deployment scenarios.