一個單一字元就能決定你的大語言模型評估成敗

摘要

常見的大型語言模型（LLM）評估依賴於示範例子來引導模型生成符合期望風格的回應。雖然使用的例子數量已被研究並標準化，但如何格式化這些例子的選擇卻較少被探討。在評估協議和實際使用中，用戶面臨著如何分隔上下文例子的選擇：使用逗號？換行？分號？井號？等等？令人驚訝的是，我們發現這個看似微小的選擇可以顯著改變模型回應的品質。在領先的模型家族（如Llama、Qwen、Gemma）中，MMLU的表現可能會因分隔符的選擇而波動達±23%。事實上，只需修改分隔例子的單一字符，就能操控模型排名，使任何模型位居榜首。我們發現LLM的脆弱性普遍存在於不同主題和模型家族中，且不會隨著模型規模的擴大而改善。通過探測注意力頭分數，我們發現表現良好的分隔符能將注意力引導至輸入中的關鍵詞彙。最後，我們探索了提升LLM對分隔符選擇的魯棒性的方法。我們發現，在提示中指定所選的分隔符能增強魯棒性，並提供了關於最佳分隔符選擇的實用建議。

English

Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by pm 23% depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

一個單一字元就能決定你的大語言模型評估成敗

A Single Character can Make or Break Your LLM Evals

摘要

Support