瑕疵還是藝術?重新思考評估大型語言模型中的提示敏感性
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs
September 1, 2025
作者: Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, Yao Qin
cs.AI
摘要
提示敏感性,指的是當重新表述(即使用不同詞語重複書面或口語內容)導致大型語言模型(LLM)性能顯著變化的現象,已被廣泛認為是LLM的核心限制之一。在本研究中,我們重新審視這一問題並提出疑問:廣泛報導的高提示敏感性是否真的是LLM的固有弱點,還是很大程度上是評估過程的產物?為回答這一問題,我們系統性地評估了7個LLM(例如GPT和Gemini系列)在6個基準測試上的表現,包括基於12種多樣提示模板的多選題和開放式任務。我們發現,許多提示敏感性源於啟發式評估方法,包括對數似然評分和嚴格的答案匹配,這些方法往往忽略了通過同義詞或改寫表達的語義正確的回應。當我們採用LLM作為評判者的評估方法時,我們觀察到性能變異性大幅降低,並且模型排名在不同提示間的一致性相關性顯著提高。我們的研究結果表明,現代LLM對提示模板的魯棒性比以往認為的要強,提示敏感性可能更多是評估過程的產物,而非模型本身的缺陷。
English
Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e.,
repeating something written or spoken using different words) leads to
significant changes in large language model (LLM) performance, has been widely
accepted as a core limitation of LLMs. In this work, we revisit this issue and
ask: Is the widely reported high prompt sensitivity truly an inherent weakness
of LLMs, or is it largely an artifact of evaluation processes? To answer this
question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family)
across 6 benchmarks, including both multiple-choice and open-ended tasks on 12
diverse prompt templates. We find that much of the prompt sensitivity stems
from heuristic evaluation methods, including log-likelihood scoring and rigid
answer matching, which often overlook semantically correct responses expressed
through alternative phrasings, such as synonyms or paraphrases. When we adopt
LLM-as-a-Judge evaluations, we observe a substantial reduction in performance
variance and a consistently higher correlation in model rankings across
prompts. Our findings suggest that modern LLMs are more robust to prompt
templates than previously believed, and that prompt sensitivity may be more an
artifact of evaluation than a flaw in the models.