SEAL：エージェントと学習環境の相乗的共進化

要旨

大規模言語モデル（LLM）エージェントは相互作用を通じてますます改善されているが、ほとんどの自己進化手法はポリシーまたは学習環境のいずれかを単独で適応させる。我々はこの構造的ギャップを「エージェント-環境のミスアライメント」として特定する。すなわち、エージェントの能力境界は訓練中に変化する一方、教師信号を提供する環境は静的であるか、エージェントが顕在化させた失敗に弱くしか結びついていない。我々は、対話的なツール使用エージェントのための閉ループ共進化フレームワークであるSEALを提案する。SEALは実行可能な検証のもとでオン・ポリシーの軌跡を収集し、失敗したロールアウトをターンレベルの失敗ラベルに診断し、これらの診断を環境側の適応とモデル側のポリシー最適化の両方に対する共有信号として使用する。環境は、より明確なツールのアフォーダンス手がかり、制約情報、回復指向のフィードバックを提示することで、訓練時の学習インターフェースを進化させる。一方、ポリシーは診断誘導型のアドバンテージ再重み付けによって更新される。分布内および分布外のマルチターンツール使用評価にわたる広範な実験により、SEALが低リソースのエージェント学習を改善することが示された。わずか400の訓練サンプルで、3つのバックボーン全体で+8.25〜+26.25の平均点向上をもたらし、正の分布外転移を示す。これらの結果は、頑健な自己改善型LLMエージェントのために、学習者とその訓練時の学習基盤を共同で適応させる価値を示している。

English

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as Agent-Environment Misalignment: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.