VerIF：指令跟随中强化学习的验证工程

摘要

带有可验证奖励的强化学习（RLVR）已成为提升大型语言模型（LLMs）的关键技术，其中验证工程扮演着核心角色。然而，在指令遵循任务中，强化学习的最佳实践仍待深入探索。本研究针对指令遵循中的强化学习验证挑战展开探讨，并提出了VerIF方法，该方法结合了基于规则的代码验证与基于大型推理模型（如QwQ-32B）的LLM验证。为支持这一方法，我们构建了一个高质量的指令遵循数据集VerInstruct，包含约22,000个实例及其对应的验证信号。我们应用VerIF对两个模型进行强化学习训练，在多个代表性指令遵循基准测试中取得了显著提升。训练后的模型在同等规模模型中达到了顶尖性能，并能很好地泛化到未见过的约束条件。进一步观察发现，它们的通用能力未受影响，这表明结合VerIF的强化学习可融入现有的强化学习方案中，以全面提升模型性能。我们已公开发布了数据集、代码及模型，以促进未来研究，详见https://github.com/THU-KEG/VerIF。

English

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF.