超越权衡:自监督强化学习在推理模型指令跟随中的应用
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
August 4, 2025
作者: Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
cs.AI
摘要
推理模型在复杂问题解决方面表现出色,但在推理能力与指令遵循能力之间存在着令人担忧的权衡。现有提升指令遵循能力的方法依赖于更强大的外部模型,这导致了方法论上的瓶颈和实际限制,包括成本增加和可访问性受限。我们提出了一种自监督强化学习框架,该框架利用推理模型自身的内部信号来提升指令遵循能力,无需外部监督。大量实验表明,我们的框架在保持推理性能的同时,显著提升了指令遵循能力,为增强推理模型的指令遵循能力提供了一种可扩展且经济高效的方法。相关数据和代码已公开于https://github.com/Rainier-rq/verl-if。
English
Reasoning models excel in complex problem solving but exhibit a concerning
trade off between reasoning capabilities and instruction following abilities.
Existing approaches for improving instruction following rely on stronger
external models, creating methodological bottlenecks and practical limitations
including increased costs and accessibility constraints. We propose a
self-supervised RL framework that leverages reasoning models' own internal
signals to improve instruction following capabilities without external
supervision. Extensive experiments demonstrate that our framework significantly
improves instruction following capabilities while maintaining reasoning
performance, offering a scalable and cost-effective approach to enhance
instruction following in reasoning models. The data and code are publicly
available at https://github.com/Rainier-rq/verl-if.