一文字がLLM評価を左右する

要旨

一般的な大規模言語モデル（LLM）の評価は、モデルの応答を望ましいスタイルに導くためのデモンストレーション例に依存しています。使用する例の数については研究され標準化されていますが、例をどのようにフォーマットするかという選択はあまり調査されていません。評価プロトコルや実際の使用において、ユーザーはコンテキスト内の例をどのように区切るかという選択に直面します：カンマを使うか？改行か？セミコロンか？ハッシュタグか？など。驚くべきことに、この一見些細な選択がモデルの応答品質を劇的に変化させることがわかります。主要なモデルファミリー（Llama、Qwen、Gemma）において、MMLUのパフォーマンスは区切り文字の選択によって±23%も変動する可能性があります。実際、例を区切る単一の文字を変更するだけで、任意のモデルを首位に立たせるようにモデルのランキングを操作することができます。LLMの脆弱性は、トピックやモデルファミリーにわたって広く見られ、スケールが大きくなっても改善されないことがわかります。アテンションヘッドのスコアを調査することで、良好なパフォーマンスを示す区切り文字が入力のキートークンに注意を向けさせることを発見しました。最後に、LLMの区切り文字の選択に対する頑健性を向上させる方法を探ります。プロンプトで選択した区切り文字を指定することで頑健性が向上し、最良のパフォーマンスを示す区切り文字を選択するための実用的な推奨事項を提供します。

English

Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by pm 23% depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

一文字がLLM評価を左右する

A Single Character can Make or Break Your LLM Evals

要旨

Support