可验证环境即乐高积木：推理泛化的递归式组合

摘要

基于可验证环境的强化学习已成为提升大语言模型推理能力的有效方法。尽管已有研究表明，扩大环境规模能改善强化学习性能，但现有的人工或单任务构建方法受限于线性扩展瓶颈，阻碍了可扩展的推理泛化。本文提出RACES（面向环境扩展的递归自动组合框架），该框架将可验证环境视作可递归组装的组合模块。其核心思想在于：当一个环境的共域（输出类型）与另一环境的定义域（输入类型）匹配时，两者可自动融合为新的可验证环境，实现递归组合。RACES基于300个独立环境实现，并定义了四类组合运算符（顺序、并行、排序与选择），从而衍生出多样化的推理模式。大量实验表明，在这些组合环境下进行的强化学习训练能持续增强推理泛化能力。具体而言，在六个训练环境构建时未曾见过的基准测试中，RACES使DeepSeek-R1-Distill-Qwen-14B的平均性能提升3.1分（从48.2升至51.3），并将Qwen3-14B的性能从58.8提升至61.1。此外，仅使用50个基础环境时，RACES即可达到与300个独立环境训练相当的性能水平，展现出显著的环境利用效率。

English

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (Recursive Automated Composition for Environment Scaling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (SEQUENTIAL, PARALLEL, SORT, and SELECT) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.