SEAL：智能体与学习环境的协同共进化

摘要

大型语言模型（LLM）代理通过交互不断改进，但现有的大多数自我进化方法仅单独调整策略或学习环境。我们识别出这种结构性的缺陷为"代理-环境失配"：代理的能力边界在训练过程中发生变化，而提供监督的环境仍保持静态，或仅与代理暴露出的失效弱耦合。我们提出SEAL，一种面向交互式工具使用代理的闭环协同进化框架。SEAL在可执行验证下收集在策略轨迹，将失败的轨迹诊断为回合级失效标签，并将这些诊断作为环境端适配与模型端策略优化的共享信号。环境通过暴露更清晰的工具功能线索、约束信息以及面向恢复的反馈来进化其训练时的学习接口，而策略则通过诊断引导的优势加权进行更新。在分布内和分布外多轮工具使用评估中的大量实验表明，SEAL能改进低资源代理学习：仅使用400个训练样本，便能在三个骨干模型上实现平均得分提升8.25至26.25个百分点，并展现出正向的分布外迁移能力。这些结果证明，联合调整学习者及其训练时学习基底对于构建鲁棒的自改进LLM代理具有重要价值。

English

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as Agent-Environment Misalignment: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.