史蒂夫进化:基于细粒度诊断与双轨知识蒸馏的开放世界具身自我演进
Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation
March 13, 2026
作者: Zhengwei Xie, Zhisheng Chen, Ziyan Weng, Tingyu Wu, Chenglong Li, Vireo Zhang, Kun Wang
cs.AI
摘要
开放世界具身智能体需解决长周期任务,其核心瓶颈并非单步规划质量,而在于交互经验的组织与演进机制。为此,我们提出Steve-Evolving——一种非参数化的自我演进框架,通过细粒度执行诊断与双轨知识蒸馏的闭环耦合实现持续进化。该方法包含三个阶段:经验锚定、经验蒸馏与知识驱动的闭环控制。具体而言,经验锚定将每个子目标尝试固化为具有固定模式的结构化经验元组(前置状态、动作、诊断结果、后置状态),并通过多维索引(如条件特征签名、空间哈希、语义标签)及滚动摘要机制,将其组织至三层经验空间,实现高效可追溯的检索。为确保归因的信息密度,执行层提供超越二元结果的组合式诊断信号,包括状态差异摘要、枚举式失败原因、连续型指标及停滞/循环检测。经验蒸馏阶段将成功轨迹泛化为具有显式前提条件与验证标准的可复用技能,同时将失败案例提炼为可执行的防护规则,这些规则能捕捉根本原因并在子目标与任务粒度上禁止风险操作。知识驱动的闭环控制则将检索到的技能与防护规则注入大语言模型规划器,通过诊断触发的局部重规划在线更新动态约束,形成无需模型参数更新的持续演进闭环。在《我的世界》MCU长周期任务集的实验表明,该方法相较静态检索基线实现持续性能提升。
English
Open-world embodied agents must solve long-horizon tasks where the main bottleneck is not single-step planning quality but how interaction experience is organized and evolved. To this end, we present Steve-Evolving, a non-parametric self-evolving framework that tightly couples fine-grained execution diagnosis with dual-track knowledge distillation in a closed loop. The method follows three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. In detail, Experience Anchoring solidifies each subgoal attempt into a structured experience tuple with a fixed schema (pre-state, action, diagnosis-result, and post-state) and organizes it in a three-tier experience space with multi-dimensional indices (e.g., condition signatures, spatial hashing, and semantic tags) plus rolling summarization for efficient and auditable recall. To ensure sufficient information density for attribution, the execution layer provides compositional diagnosis signals beyond binary outcomes, including state-difference summaries, enumerated failure causes, continuous indicators, and stagnation/loop detection. Moreover, successful trajectories of Experience Distillation are generalized into reusable skills with explicit preconditions and verification criteria, while failures are distilled into executable guardrails that capture root causes and forbid risky operations at both subgoal and task granularities. Besides, Knowledge-Driven Closed-Loop Control retrieved skills and guardrails are injected into an LLM planner, and diagnosis-triggered local replanning updates the active constraints online, forming a continual evolution process without any model parameter updates. Experiments on the long-horizon suite of Minecraft MCU demonstrate consistent improvements over static-retrieval baselines.