缺陷还是人为产物？重新思考大语言模型评估中的提示敏感性

摘要

提示敏感性，即通过改写（即使用不同词语重复表达相同内容）导致大型语言模型（LLM）性能显著变化的现象，已被广泛认为是LLM的核心局限之一。在本研究中，我们重新审视这一问题并提出疑问：广为报道的高提示敏感性是否真的是LLM固有的弱点，还是很大程度上源于评估过程的假象？为解答此问题，我们系统性地评估了7个LLM（如GPT和Gemini系列）在6个基准测试上的表现，涵盖12种多样化的提示模板，包括多项选择和开放式任务。我们发现，大部分提示敏感性源于启发式评估方法，如对数似然评分和严格的答案匹配，这些方法常常忽视了通过同义词或改写等不同表达方式传达的正确语义。当我们采用LLM作为评判者的评估方式时，观察到性能方差显著降低，且模型排名在不同提示间的一致性相关性更高。我们的研究结果表明，现代LLM对提示模板的鲁棒性远超以往认知，提示敏感性可能更多是评估过程中的假象，而非模型本身的缺陷。

English

Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.

缺陷还是人为产物？重新思考大语言模型评估中的提示敏感性

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

摘要

Support