ChatPaper.aiChatPaper

可驗證環境就是樂高積木:遞迴組合實現推理泛化

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

June 10, 2026
作者: Hao Xiang, Qiaoyu Tang, Le Yu, Yaojie Lu, Xianpei Han, Ben He, Le Sun, Bowen Yu, Peng Wang, Hongyu Lin, Dayiheng Liu
cs.AI

摘要

基于可验证环境的强化学习(RL)已成为提升大型语言模型(LLMs)推理能力的有力方法。虽然先前研究表明,扩大环境规模可提升RL性能,但现有的人工或单独构建方法受限于线性扩展瓶颈,从而阻碍了可扩展的推理泛化。本文提出RACES(递归自动化环境组合框架),该框架将可验证环境视为可组合的构建模块,支持递归装配。其核心见解在于:当一个环境的上域(输出类型)与另一环境的定义域(输入类型)匹配时,二者可自动融合为新的可验证环境,实现递归组合。RACES基于300个独立环境实现,并定义了四种组合算子(顺序、并行、排序与选择),可诱导多样化的推理模式。大量实验表明,在这些组合环境上的RL训练持续提升了推理泛化能力。具体而言,RACES使DeepSeek-R1-Distill-Qwen-14B在六个基准测试上的平均得分提升3.1分(从48.2增至51.3),并将Qwen3-14B的性能从58.8提升至61.1——这些基准测试在训练环境构建过程中均未出现。此外,RACES仅使用50个基础环境即可达到与300个独立环境训练相当的性能,充分体现了环境利用的高效性。
English
Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (Recursive Automated Composition for Environment Scaling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (SEQUENTIAL, PARALLEL, SORT, and SELECT) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.