原子指令鴻溝：指令微調的大型語言模型在處理簡單、自包含的指令時面臨挑戰

摘要

經過指令微調的大型語言模型（IT-LLMs）展現出強大的零樣本推理能力，然而它們執行簡單、自包含指令的能力仍未被充分探索，儘管這是複雜指令遵循的基礎。我們在修改後的MMLU和MMLU-Pro基準上評估了20個IT-LLMs，通過系統性地改變選項標籤的格式（字母、數字、羅馬數字）同時保持其意義不變，並在四種範式下進行分析：(1) 在有明確指令的情況下，標籤變化導致性能大幅波動（例如，羅馬數字與數字相比下降30.45%），揭示了指令格式偏見。(2) 在沒有指令的情況下，性能進一步下降（最多下降10.84%），且對標籤的敏感性加劇，強調了明確指導的重要性。(3) 當選項內容被移除時，模型無法超越隨機選擇基準，除非使用數字標籤，這表明對基本指令的遵循能力較弱。(4) 三樣本示例並未顯著提升模型的魯棒性或忠實度，生成分析顯示標籤錯誤持續存在，尤其是在非數字格式中。在不同模型規模下，更大的LLMs實現了更高的準確率，但在指令遵循上仍不一致。這些結果揭示了當前指令微調範式的不足，並強調了需要針對基本指令遵循的評估方法和訓練策略。

English

Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

原子指令鴻溝：指令微調的大型語言模型在處理簡單、自包含的指令時面臨挑戰

The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

摘要

Support