激励推理以提升大型语言模型的高级指令跟随能力
Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models
June 2, 2025
作者: Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun
cs.AI
摘要
现有的大型语言模型(LLMs)在遵循复杂指令时面临挑战,尤其是当存在多个约束条件并以并行、链式和分支结构组织时。一种直观的解决方案,即思维链(CoT),被期望能普遍提升LLMs的能力。然而,我们发现,由于CoT仅对指令进行表面化的重述,其浅层的推理模式反而对性能产生了负面影响。它未能深入剖析约束条件的构成,以识别它们在类型和维度层次上的关系。为此,我们提出了一种系统方法,通过激励推理以实现测试时计算规模的扩展,从而提升LLMs处理复杂指令的能力。首先,我们从现有分类体系下对复杂指令的分解出发,提出了一种可复现的数据获取方法。其次,我们利用强化学习(RL)结合可验证的规则中心奖励信号,专门培养遵循指令的推理能力。我们通过样本间的对比,针对复杂指令下推理的浅层和非本质特性,强化了CoT的执行效果。同时,我们还采用专家行为克隆,促进LLMs从快速思维向熟练推理者的稳定分布转变。在七个综合基准上的广泛评估验证了所提方法的有效性,其中1.5B参数的LLM实现了11.74%的性能提升,其表现可与8B参数的LLM相媲美。代码和数据可在https://github.com/yuleiqin/RAIF获取。
English
Existing large language models (LLMs) face challenges of following complex
instructions, especially when multiple constraints are present and organized in
paralleling, chaining, and branching structures. One intuitive solution, namely
chain-of-thought (CoT), is expected to universally improve capabilities of
LLMs. However, we find that the vanilla CoT exerts a negative impact on
performance due to its superficial reasoning pattern of simply paraphrasing the
instructions. It fails to peel back the compositions of constraints for
identifying their relationship across hierarchies of types and dimensions. To
this end, we propose a systematic method to boost LLMs in dealing with complex
instructions via incentivizing reasoning for test-time compute scaling. First,
we stem from the decomposition of complex instructions under existing
taxonomies and propose a reproducible data acquisition method. Second, we
exploit reinforcement learning (RL) with verifiable rule-centric reward signals
to cultivate reasoning specifically for instruction following. We address the
shallow, non-essential nature of reasoning under complex instructions via
sample-wise contrast for superior CoT enforcement. We also exploit behavior
cloning of experts to facilitate steady distribution shift from fast-thinking
LLMs to skillful reasoners. Extensive evaluations on seven comprehensive
benchmarks confirm the validity of the proposed method, where a 1.5B LLM
achieves 11.74% gains with performance comparable to a 8B LLM. Codes and data
are available at https://github.com/yuleiqin/RAIF.Summary
AI-Generated Summary