アトミック命令ギャップ：命令チューニングされたLLMは単純で自己完結的な指示に苦戦する

要旨

命令チューニングされた大規模言語モデル（IT-LLM）は、強力なゼロショット推論能力を示すが、単純で自己完結した命令を実行する能力は十分に検証されておらず、これは複雑な命令追従の基盤となるものである。我々は、MMLUおよびMMLU-Proベンチマークを改変し、オプションラベルの形式（アルファベット、数字、ローマ数字）を体系的に変化させながら、その意味を同一に保つ4つのパラダイムの下で20のIT-LLMを評価した。具体的には、(1) 明示的な指示がある場合、ラベルの変更が大きな性能シフトを引き起こし（例：ローマ数字 vs. 数字で-30.45%）、指示形式のバイアスが明らかになった。(2) 指示がない場合、性能はさらに低下し（最大-10.84%）、ラベルに対する感度が強まり、明示的なガイダンスの重要性が強調された。(3) オプションの内容を除去すると、数字ラベルを除いてモデルはランダム選択のベースラインを下回り、原子指示への弱い遵守が示唆された。(4) 3ショットの例示は、堅牢性や忠実性の向上に有意な効果をもたらさず、生成分析では特に非数字形式でのラベルエラーが持続することが明らかになった。モデルサイズ全体を通じて、より大きなLLMは高い精度を達成するが、命令遵守の一貫性は保たれなかった。これらの結果は、現在の命令チューニングパラダイムの不十分さを露呈し、原子指示追従を明示的にターゲットとした評価方法とトレーニング戦略の必要性を強調している。

English

Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

アトミック命令ギャップ：命令チューニングされたLLMは単純で自己完結的な指示に苦戦する

The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

要旨

Support