조합적 합성: 원자적 분해 및 재조합을 통한 코드 RLVR의 스케일링

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 최근 대규모 언어 모델(LLMs)의 뛰어난 코딩 능력을 형성하는 핵심 요소로 부상했다. 그러나 RLVR의 확장성은 모델의 역량 경계 근처를 목표로 하는 충분히 도전적인 검증 가능한 코드 과제의 부족으로 인해 심각하게 제약을 받는다. 선행 연구들은 종종 데이터 합성을 위해 휴리스틱 시드 확장에 의존하는데, 이는 참신성과 난이도 모두를 심각하게 제한한다. 결과적으로, 이러한 데이터의 훈련 가치는 합성 규모에 비례하여 확장되지 못한다. 이를 해결하기 위해, 우리는 원자 분해 및 재조합(ADR)이라는 새로운 프레임워크를 제안한다. 이 프레임워크는 검증 가능한 코드 과제를 원자 요소로 분해하고 통제된 재조합을 통해 생성함으로써, 진정으로 참신하고 도전적인 검증 가능한 코드 과제의 생성을 가능하게 한다. 실험과 분석은 ADR이 기존 기준선에 비해 우수한 독창성, 난이도, 다양성 및 테스트 품질을 달성하며, 알고리즘 프로그래밍, 도구 사용, 데이터 과학을 포함한 다양한 하위 도메인의 RLVR 전반에 걸쳐 코드 능력에서 일관되게 더 큰 향상을 제공함을 보여준다. 우리의 연구는 새로운 코드 과제 합성과 확장 가능한 RLVR 훈련을 위한 새로운 패러다임에 빛을 비춘다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities of Large Language Models (LLMs). However, the scalability of RLVR is severely constrained by the scarcity of sufficiently challenging verifiable code tasks that target near the model's edge of competence. Prior studies often rely on heuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we propose Atomic Decomposition and Recombination (ADR), a novel framework that generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, including algorithmic programming, tool usage, and data science. Our work sheds light on a new paradigm for novel code task synthesis and scalable RLVR training.