瑕疵還是藝術？重新思考評估大型語言模型中的提示敏感性

摘要

提示敏感性，指的是當重新表述（即使用不同詞語重複書面或口語內容）導致大型語言模型（LLM）性能顯著變化的現象，已被廣泛認為是LLM的核心限制之一。在本研究中，我們重新審視這一問題並提出疑問：廣泛報導的高提示敏感性是否真的是LLM的固有弱點，還是很大程度上是評估過程的產物？為回答這一問題，我們系統性地評估了7個LLM（例如GPT和Gemini系列）在6個基準測試上的表現，包括基於12種多樣提示模板的多選題和開放式任務。我們發現，許多提示敏感性源於啟發式評估方法，包括對數似然評分和嚴格的答案匹配，這些方法往往忽略了通過同義詞或改寫表達的語義正確的回應。當我們採用LLM作為評判者的評估方法時，我們觀察到性能變異性大幅降低，並且模型排名在不同提示間的一致性相關性顯著提高。我們的研究結果表明，現代LLM對提示模板的魯棒性比以往認為的要強，提示敏感性可能更多是評估過程的產物，而非模型本身的缺陷。

English

Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.

瑕疵還是藝術？重新思考評估大型語言模型中的提示敏感性

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

摘要

Support