ChatPaper.aiChatPaper

论基于基准测试的大语言模型评估的鲁棒性与可靠性

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

September 4, 2025
作者: Riccardo Lunardi, Vincenzo Della Mea, Stefano Mizzaro, Kevin Roitero
cs.AI

摘要

大型语言模型(LLMs)的有效性通常通过诸如MMLU、ARC-C或HellaSwag等基准测试来评估,这些测试中的问题以其原始表述呈现,即采用固定、标准化的格式。然而,实际应用场景涉及语言的多样性,要求模型在面对同一问题或查询的不同表述时仍能保持其有效性。在本研究中,我们系统性地评估了LLMs对经过改写的基准问题的鲁棒性,并探讨了基于基准的评估是否能为模型能力提供可靠的衡量标准。我们系统性地生成了六个不同常见基准测试中所有问题的多种改写版本,并测量了34个不同规模和有效性的最先进LLMs在应对这些改写问题时的效果变化。我们的研究结果表明,尽管LLMs在应对改写输入时的排名相对稳定,但其绝对有效性得分却发生了变化,且显著下降。这表明LLMs在处理语言多样性方面存在困难,引发了对其泛化能力和评估方法的担忧。此外,观察到的性能下降挑战了基于基准评估的可靠性,表明高基准得分可能无法全面反映模型对现实世界输入变化的鲁棒性。我们讨论了这些发现对LLM评估方法的影响,强调需要开发更能反映实际部署场景的鲁棒性感知基准测试。
English
Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model's robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.
PDF32September 8, 2025