環境を構築する学習：検証可能な環境合成による自己進化型推論強化学習

要旨

我々は、モデルが単に追従すべき問題や軌跡を生成するのではなく、自身を訓練する環境を構築するという、自己改善型言語モデルのビジョンを追求する。ゼロデータ推論RLにおいて、これは自己改善をデータ生成ループから環境構築ループへと再構成するものであり、各成果物はインスタンスをサンプリングし、参照を計算し、応答をスコアリングする再利用可能な実行可能オブジェクトとなる。このビジョンが持続的な改善をもたらすか否かは、ただ一つの特性に依存する。すなわち、環境が安定した解決-検証非対称性を示さなければならない。モデルは一度だけオラクルを記述でき、そのオラクルを新しいインスタンスに対して自然言語で確実に実行することはできない。この非対称性は、二つの相補的な形をとる。あるタスクは、アルゴリズム的に推論するのは難しいが、コードとしては些末である。動的計画法やグラフ探索は一度コンパイルすれば、無数の較正済みインスタンスを生み出す。別のタスクは、本質的に解決は難しいが検証は容易であり、例えば埋め込まれた部分和問題や制約充足問題が該当する。いずれも、提案と解決の間に永続的なギャップを生み出し、方策が検証器を欺くことでこのギャップを埋めることはできない。学習者が向上しても報酬が有益であり続けるのは、まさにこのギャップによるものである。我々はこの見解をEvoEnvとして具体化する。これは単一の方策生成・解決手法であり、10個のシードからPython環境を合成し、段階的検証、意味的自己レビュー、解法相対難易度調整、新規性チェックを経てのみ環境を受理する。最も強力な証拠は、すでに強いとされる領域から得られる。Qwen3-4B-Thinkingにおいて、固定公開データRLVRと固定手作り環境RLVRは平均を低下させるのに対し、EvoEnvはそれを72.4から74.8へと向上させ、相対ゲイン3.3%を達成する。安定した自己改善は、より多くの合成データを生成することではなく、モデルが自身の手の届かないところに構造的に難しさが留まるような世界を構築することを学ぶことに依存している、と我々は示唆する。

English

We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.