組合式合成：通過原子分解與重組擴展代碼RLVR

摘要

基於可驗證獎勵的強化學習（RLVR）近期已成為塑造大型語言模型（LLMs）卓越編碼能力的核心基石。然而，RLVR的可擴展性受到嚴重限制，原因在於缺乏足夠具有挑戰性且貼近模型能力邊緣的可驗證編碼任務。先前研究常依賴啟發式種子擴展進行數據合成，這嚴重限制了任務的新穎性與難度。因此，此類數據的訓練價值無法隨著合成規模成比例提升。為此，我們提出原子分解與重組（ADR）框架，通過將可驗證編碼任務分解為原子元素並進行受控重組，從而實現生成真正新穎且具挑戰性的可驗證編碼任務。實驗與分析表明，相較於現有基準方法，ADR在原創性、難度、多樣性及測試品質上均表現更優，並能在多樣下游領域（包括演算法程式設計、工具使用及數據科學）的RLVR訓練中，持續帶來更顯著的編碼能力提升。本研究為新穎編碼任務合成與可擴展的RLVR訓練開闢了新範式。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities of Large Language Models (LLMs). However, the scalability of RLVR is severely constrained by the scarcity of sufficiently challenging verifiable code tasks that target near the model's edge of competence. Prior studies often rely on heuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we propose Atomic Decomposition and Recombination (ADR), a novel framework that generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, including algorithmic programming, tool usage, and data science. Our work sheds light on a new paradigm for novel code task synthesis and scalable RLVR training.