論大語言模型基於基準測試評估的魯棒性與可靠性

摘要

大型语言模型（LLMs）的有效性通常通过诸如MMLU、ARC-C或HellaSwag等基准测试来评估，这些测试中的问题以其原始措辞呈现，因而采用固定且标准化的格式。然而，现实世界的应用涉及语言多样性，要求模型能够在同一问题或查询的不同重述中保持其有效性。在本研究中，我们系统地评估了LLMs对改写后基准问题的鲁棒性，并探讨了基于基准的评估是否能够可靠地衡量模型的能力。我们系统地生成了六个不同常见基准测试中所有问题的多种改写版本，并测量了34个不同规模和有效性的顶尖LLMs在有效性上的变化。我们的研究结果显示，尽管LLMs在改写输入上的排名相对稳定，但绝对有效性分数发生变化，且显著下降。这表明LLMs在处理语言多样性方面存在困难，引发了对其泛化能力和评估方法的担忧。此外，观察到的性能下降挑战了基于基准评估的可靠性，表明高基准分数可能无法完全捕捉模型对现实世界输入变化的鲁棒性。我们讨论了这些发现对LLM评估方法的影响，强调需要开发更能反映实际部署场景的鲁棒性感知基准。

English

Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model's robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.

論大語言模型基於基準測試評估的魯棒性與可靠性

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

摘要

Support