原子指令鴻溝:指令微調的大型語言模型在處理簡單、自包含的指令時面臨挑戰
The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives
October 20, 2025
作者: Henry Lim, Kwan Hui Lim
cs.AI
摘要
經過指令微調的大型語言模型(IT-LLMs)展現出強大的零樣本推理能力,然而它們執行簡單、自包含指令的能力仍未被充分探索,儘管這是複雜指令遵循的基礎。我們在修改後的MMLU和MMLU-Pro基準上評估了20個IT-LLMs,通過系統性地改變選項標籤的格式(字母、數字、羅馬數字)同時保持其意義不變,並在四種範式下進行分析:(1) 在有明確指令的情況下,標籤變化導致性能大幅波動(例如,羅馬數字與數字相比下降30.45%),揭示了指令格式偏見。(2) 在沒有指令的情況下,性能進一步下降(最多下降10.84%),且對標籤的敏感性加劇,強調了明確指導的重要性。(3) 當選項內容被移除時,模型無法超越隨機選擇基準,除非使用數字標籤,這表明對基本指令的遵循能力較弱。(4) 三樣本示例並未顯著提升模型的魯棒性或忠實度,生成分析顯示標籤錯誤持續存在,尤其是在非數字格式中。在不同模型規模下,更大的LLMs實現了更高的準確率,但在指令遵循上仍不一致。這些結果揭示了當前指令微調範式的不足,並強調了需要針對基本指令遵循的評估方法和訓練策略。
English
Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot
reasoning, yet their ability to execute simple, self-contained instructions
remains underexplored, despite this being foundational to complex
instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro
benchmarks, by systematically varying the format of option labels (alphabetic,
numeric, Roman) while keeping their meaning identical under four paradigms,
namely: (1) With explicit instructions, label changes cause large performance
shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format
bias. (2) Without instructions, performance drops further (up to -10.84\%) and
label sensitivity intensifies, underscoring the role of explicit guidance. (3)
When option contents are removed, models fail random-choice baselines except
with numeric labels, suggesting weak adherence to atomic directives. (4)
Three-shot exemplars yield no significant gains in robustness or fidelity, and
generation analyses show persistent label errors, especially for non-numeric
formats. Across model sizes, larger LLMs achieve higher accuracy but remain
inconsistent in instruction adherence. These results expose the insufficiencies
of current instruction-tuning paradigms and highlight the need for evaluation
methods and training strategies that explicitly target atomic
instruction-following.