VerIF: 지시 따르기 강화 학습을 위한 검증 엔지니어링

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 대규모 언어 모델(LLM)을 향상시키는 핵심 기술로 자리 잡았으며, 검증 엔지니어링이 중심적인 역할을 하고 있다. 그러나 명령어 수행을 위한 강화 학습의 최적의 실천 방법은 아직 충분히 탐구되지 않았다. 본 연구에서는 명령어 수행을 위한 강화 학습에서의 검증 문제를 탐구하고, 규칙 기반 코드 검증과 대형 추론 모델(예: QwQ-32B) 기반의 LLM 검증을 결합한 검증 방법인 VerIF를 제안한다. 이를 지원하기 위해, 약 22,000개의 인스턴스와 관련 검증 신호를 포함한 고품질 명령어 수행 데이터셋인 VerInstruct를 구축하였다. VerIF를 적용한 강화 학습을 두 모델에 적용하여, 여러 대표적인 명령어 수행 벤치마크에서 상당한 개선을 달성하였다. 훈련된 모델들은 동일 규모의 모델들 중 최고 수준의 성능을 보였으며, 보이지 않는 제약 조건에도 잘 일반화되었다. 또한, 이들의 일반적인 능력이 영향을 받지 않았음을 관찰하여, VerIF를 통한 강화 학습이 기존의 강화 학습 레시피에 통합되어 전반적인 모델 성능을 향상시킬 수 있음을 시사한다. 향후 연구를 촉진하기 위해 데이터셋, 코드, 모델을 https://github.com/THU-KEG/VerIF에서 공개하였다.

English

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF.