学习构建环境：通过可验证环境合成的自进化推理强化学习

摘要

我们追求一个自改进语言模型的愿景：模型不仅仅是生成待模仿的问题或轨迹，而是构建训练自己的环境。在零数据推理强化学习中，这将对自改进的认知从数据生成循环重新定义为环境构建循环——每个产物都是可复用的可执行对象，能够采样实例、计算参考解并给回应评分。这一愿景能否持续提升，取决于一个关键属性：环境必须表现出稳定的“求解—验证不对称性”，即模型必须有能力编写一次预言机，但在新实例上无法通过自然语言可靠地执行该预言机。这种不对称性表现为两种互补形式。某些任务虽在算法推理上困难，但作为代码却轻而易举：例如动态规划或图遍历算法，只需编译一次，即可生成无限多的校准实例。另一些任务则本质上难以求解但易于验证，如植式子集和或约束满足问题。两者都在提议与求解之间形成了持久的差距——策略无法通过欺骗验证器来弥补这一差距，而正是这种差距使得奖励在学习者进步时始终保持信息量。我们通过EvoEnv将这一观点实例化，它是一种单策略生成—求解方法，从十个种子合成Python环境，并只在其通过分阶段验证、语义自审、解相对难度校准和新颖性检查后才予以接纳。最强的证据来自已经强大的模型：在Qwen3-4B-Thinking上，固定公共数据的RLVR和固定手工环境的RLVR使平均性能下降，而EvoEnv则将其从72.4提升至74.8，相对增益3.3%。我们认为，稳定的自改进并不依赖于生成更多的合成数据，而在于模型学会构建那些难度在结构上始终超越自身能力的世界。

English

We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.