트레이드오프를 넘어: 추론 모델의 지시 수행을 위한 자기 지도 강화 학습

초록

추론 모델은 복잡한 문제 해결에 뛰어나지만, 추론 능력과 명령 수행 능력 사이에서 우려스러운 트레이드오프를 보입니다. 기존의 명령 수행 능력 향상 접근법은 더 강력한 외부 모델에 의존함으로써 방법론적 병목 현상과 비용 증가, 접근성 제약 등의 실질적인 한계를 초래했습니다. 우리는 외부 감독 없이 추론 모델의 내부 신호를 활용하여 명령 수행 능력을 향상시키는 자기 지도 강화 학습(RL) 프레임워크를 제안합니다. 광범위한 실험을 통해 우리의 프레임워크가 추론 성능을 유지하면서도 명령 수행 능력을 크게 향상시킴을 입증하였으며, 이는 추론 모델의 명령 수행 능력을 강화하기 위한 확장 가능하고 비용 효율적인 접근 방식을 제공합니다. 데이터와 코드는 https://github.com/Rainier-rq/verl-if에서 공개되어 있습니다.

English

Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.

트레이드오프를 넘어: 추론 모델의 지시 수행을 위한 자기 지도 강화 학습

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

초록

Support