超越权衡：自监督强化学习在推理模型指令跟随中的应用

摘要

推理模型在复杂问题解决方面表现出色，但在推理能力与指令遵循能力之间存在着令人担忧的权衡。现有提升指令遵循能力的方法依赖于更强大的外部模型，这导致了方法论上的瓶颈和实际限制，包括成本增加和可访问性受限。我们提出了一种自监督强化学习框架，该框架利用推理模型自身的内部信号来提升指令遵循能力，无需外部监督。大量实验表明，我们的框架在保持推理性能的同时，显著提升了指令遵循能力，为增强推理模型的指令遵循能力提供了一种可扩展且经济高效的方法。相关数据和代码已公开于https://github.com/Rainier-rq/verl-if。

English

Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.

超越权衡：自监督强化学习在推理模型指令跟随中的应用

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

摘要

Support