검증 가능한 환경은 LEGO 블록이다: 추론 일반화를 위한 재귀적 구성

초록

검증 가능한 환경에서의 강화 학습(Reinforcement Learning, RL)은 대규모 언어 모델(Large Language Models, LLMs)의 추론 능력을 향상시키는 강력한 접근 방식으로 부상했다. 기존 연구는 환경 규모를 확장하는 것이 RL 성능을 향상시킨다는 것을 보여주지만, 기존의 수동 또는 개별적 구성 방법은 선형적 확장 한계를 겪어 확장 가능한 추론 일반화를 저해한다. 본 논문은 검증 가능한 환경을 재귀적으로 조립할 수 있는 구성 가능한 빌딩 블록으로 개념화하는 프레임워크인 RACES(Recursive Automated Composition for Environment Scaling)를 소개한다. 핵심 통찰은 한 환경의 공역(출력 유형)이 다른 환경의 정의역(입력 유형)과 일치할 때, 이들이 자동으로 새로운 검증 가능한 환경으로 융합되어 재귀적 구성을 가능하게 한다는 점이다. RACES는 300개의 개별 환경으로 구현되며, 다양한 추론 패턴을 유도하는 합성 연산자(SEQUENTIAL, PARALLEL, SORT, SELECT) 집합을 정의한다. 광범위한 실험을 통해 이러한 합성 환경에서의 RL 훈련이 지속적으로 추론 일반화를 향상시킴을 보여준다. 구체적으로, RACES는 DeepSeek-R1-Distill-Qwen-14B의 성능을 6개 벤치마크에서 평균 3.1점 향상(48.2에서 51.3)시키고, Qwen3-14B의 성능은 58.8에서 61.1로 향상시키며, 이 벤치마크들은 훈련 환경 구성 중에는 보지 못했던 것들이다. 더욱이, RACES는 50개의 기본 환경만을 사용하여 300개의 개별 환경에서 훈련한 것과 유사한 성능을 달성함으로써 환경 활용의 현저한 효율성을 입증한다.

English

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (Recursive Automated Composition for Environment Scaling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (SEQUENTIAL, PARALLEL, SORT, and SELECT) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.