検証可能な環境はLEGOブロックである：推論の一般化のための再帰的構成

要旨

検証可能な環境を伴う強化学習は、大規模言語モデルの推論能力を向上させる強力なアプローチとして注目されている。先行研究では、環境数のスケーリングが強化学習の性能を向上させることが示されているが、既存の手動または個別構築手法は線形スケーリングの限界に直面し、スケーラブルな推論の汎化を妨げている。本稿では、検証可能な環境を再帰的に組み立て可能な合成可能な構成要素として捉えるフレームワーク、RACES（環境スケーリングのための再帰的自動合成）を導入する。その核心的な洞察は、ある環境の余域（出力型）が別の環境の定義域（入力型）と一致する場合、それらを自動的に融合して新たな検証可能な環境とし、再帰的な合成を可能にすることにある。RACESは300個の個別環境で実装され、逐次、並列、ソート、選択という合成演算子を定義し、多様な推論パターンを誘導する。大規模な実験により、これらの合成環境での強化学習訓練が一貫して推論の汎化を向上させることが示された。具体的には、RACESはDeepSeek-R1-Distill-Qwen-14Bの性能を平均3.1ポイント（48.2から51.3）向上させ、Qwen3-14Bの性能を6つのベンチマーク（訓練環境の構築中には未見）において58.8から61.1へと押し上げた。さらに、RACESはわずか50個の基本環境を用いて、300個の個別環境での訓練と同等の性能を達成し、環境利用における顕著な効率性を示している。

English

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (Recursive Automated Composition for Environment Scaling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (SEQUENTIAL, PARALLEL, SORT, and SELECT) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.