SEAL：智能體與學習環境的協同共進化

摘要

大型語言模型（LLM）代理的表現正透過互動逐步提升，然而現有的自我演化方法多半僅針對策略或學習環境其中一項進行調整。我們將此結構性缺陷定義為「代理-環境錯配」：代理的能力邊界在訓練過程中不斷變化，而提供監督訊號的環境卻維持靜態，或僅與代理所揭露的失敗行為保持鬆散耦合。為此，我們提出SEAL，一個適用於互動式工具使用代理的閉環共同演化框架。SEAL在可執行驗證下收集同策略軌跡，將失敗的軌跡展開診斷為回合層級的失敗標籤，並將這些診斷結果作為共享訊號，同時驅動環境端適應與模型端策略最佳化。環境方面，透過提供更清晰的工具啟示意圖、限制條件資訊及復原導向回饋，演化其訓練階段的學習介面；策略方面，則依據診斷引導的優勢權重重新加權進行更新。涵蓋分佈內與分佈外多輪工具使用評估的大量實驗結果顯示，SEAL能改善低資源代理的學習效能：僅使用400筆訓練樣本，即可在三個骨幹模型上獲得+8.25至+26.25的平均分數提升，並展現正向的分佈外遷移能力。這些成果證明了對於穩健自我強化的LLM代理而言，同步調整學習者及其訓練階段學習基礎設施的價值。

English

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as Agent-Environment Misalignment: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.