原子指令鸿沟：指令调优的大型语言模型在应对简单、自包含的指令时表现欠佳

摘要

经过指令微调的大语言模型（IT-LLMs）展现出强大的零样本推理能力，然而其在执行简单、自包含指令方面的能力仍未被充分探索，尽管这是复杂指令遵循的基础。我们在修改后的MMLU和MMLU-Pro基准上评估了20个IT-LLMs，通过系统性地改变选项标签的格式（字母、数字、罗马数字）同时保持其含义一致，在四种范式下进行测试：(1) 在有明确指令的情况下，标签变化导致性能大幅波动（例如，罗马数字与数字相比下降30.45%），揭示了指令格式偏见。(2) 在没有指令时，性能进一步下降（最多下降10.84%），且对标签的敏感性增强，凸显了明确指导的重要性。(3) 当选项内容被移除时，除数字标签外，模型未能超越随机选择基线，表明对原子指令的遵循能力较弱。(4) 三样本示例并未显著提升鲁棒性或忠实度，生成分析显示标签错误持续存在，尤其是非数字格式。在不同模型规模中，更大的LLMs虽达到更高准确率，但在指令遵循上仍不一致。这些结果揭示了当前指令微调范式的不足，并强调需要针对原子指令遵循的评估方法和训练策略。

English

Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

原子指令鸿沟：指令调优的大型语言模型在应对简单、自包含的指令时表现欠佳

The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

摘要

Support