原子指令鸿沟:指令调优的大型语言模型在应对简单、自包含的指令时表现欠佳
The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives
October 20, 2025
作者: Henry Lim, Kwan Hui Lim
cs.AI
摘要
经过指令微调的大语言模型(IT-LLMs)展现出强大的零样本推理能力,然而其在执行简单、自包含指令方面的能力仍未被充分探索,尽管这是复杂指令遵循的基础。我们在修改后的MMLU和MMLU-Pro基准上评估了20个IT-LLMs,通过系统性地改变选项标签的格式(字母、数字、罗马数字)同时保持其含义一致,在四种范式下进行测试:(1) 在有明确指令的情况下,标签变化导致性能大幅波动(例如,罗马数字与数字相比下降30.45%),揭示了指令格式偏见。(2) 在没有指令时,性能进一步下降(最多下降10.84%),且对标签的敏感性增强,凸显了明确指导的重要性。(3) 当选项内容被移除时,除数字标签外,模型未能超越随机选择基线,表明对原子指令的遵循能力较弱。(4) 三样本示例并未显著提升鲁棒性或忠实度,生成分析显示标签错误持续存在,尤其是非数字格式。在不同模型规模中,更大的LLMs虽达到更高准确率,但在指令遵循上仍不一致。这些结果揭示了当前指令微调范式的不足,并强调需要针对原子指令遵循的评估方法和训练策略。
English
Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot
reasoning, yet their ability to execute simple, self-contained instructions
remains underexplored, despite this being foundational to complex
instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro
benchmarks, by systematically varying the format of option labels (alphabetic,
numeric, Roman) while keeping their meaning identical under four paradigms,
namely: (1) With explicit instructions, label changes cause large performance
shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format
bias. (2) Without instructions, performance drops further (up to -10.84\%) and
label sensitivity intensifies, underscoring the role of explicit guidance. (3)
When option contents are removed, models fail random-choice baselines except
with numeric labels, suggesting weak adherence to atomic directives. (4)
Three-shot exemplars yield no significant gains in robustness or fidelity, and
generation analyses show persistent label errors, especially for non-numeric
formats. Across model sizes, larger LLMs achieve higher accuracy but remain
inconsistent in instruction adherence. These results expose the insufficiencies
of current instruction-tuning paradigms and highlight the need for evaluation
methods and training strategies that explicitly target atomic
instruction-following.