Steve-Evolving: 細粒度診断とデュアルトラック知識蒸留によるオープンワールド具象化自己進化

要旨

オープンワールドにおける具現化エージェントは、長期的なタスクを解決する必要がある。その際の主要なボトルネックは、単一ステップの計画の質ではなく、インタラクション経験がどのように組織化され進化するかである。この課題に対処するため、我々はSteve-Evolvingを提案する。これは非パラメトリックな自己進化フレームワークであり、細粒度な実行診断とデュアルトラックの知識蒸留を閉ループで緊密に結合する。本手法は、経験の定着、経験の蒸留、知識駆動型閉ループ制御の3つのフェーズから構成される。具体的には、経験の定着フェーズでは、各サブゴールへの試行を固定スキーマ（事前状態、行動、診断結果、事後状態）を持つ構造化された経験タプルとして固化し、多次元インデックス（条件シグネチャ、空間ハッシュ、セマンティックタグなど）とローリング要約を備えた3層の経験空間で組織化する。これにより、効率的かつ監査可能な想起を実現する。帰属分析に十分な情報密度を確保するため、実行層では二値結果を超えた合成的な診断信号（状態差分要約、列挙された失敗原因、連続指標、停滞/ループ検出など）を提供する。さらに、経験の蒸留フェーズでは、成功した軌跡は明示的な前提条件と検証基準を持つ再利用可能なスキルへと一般化され、失敗は根本原因を捕捉し、サブゴール及びタスク粒度で危険な操作を禁止する実行可能なガードレールへと蒸留される。加えて、知識駆動型閉ループ制御フェーズでは、検索されたスキルとガードレールがLLMプランナーに注入され、診断によってトリガーされる局所的再計画が能動的制約をオンラインで更新する。これにより、モデルパラメータの更新を一切伴わない継続的進化プロセスが形成される。 Minecraft MCUの長期タスクスイートを用いた実験により、静的検索ベースラインに対する一貫した改善が実証された。

English

Open-world embodied agents must solve long-horizon tasks where the main bottleneck is not single-step planning quality but how interaction experience is organized and evolved. To this end, we present Steve-Evolving, a non-parametric self-evolving framework that tightly couples fine-grained execution diagnosis with dual-track knowledge distillation in a closed loop. The method follows three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. In detail, Experience Anchoring solidifies each subgoal attempt into a structured experience tuple with a fixed schema (pre-state, action, diagnosis-result, and post-state) and organizes it in a three-tier experience space with multi-dimensional indices (e.g., condition signatures, spatial hashing, and semantic tags) plus rolling summarization for efficient and auditable recall. To ensure sufficient information density for attribution, the execution layer provides compositional diagnosis signals beyond binary outcomes, including state-difference summaries, enumerated failure causes, continuous indicators, and stagnation/loop detection. Moreover, successful trajectories of Experience Distillation are generalized into reusable skills with explicit preconditions and verification criteria, while failures are distilled into executable guardrails that capture root causes and forbid risky operations at both subgoal and task granularities. Besides, Knowledge-Driven Closed-Loop Control retrieved skills and guardrails are injected into an LLM planner, and diagnosis-triggered local replanning updates the active constraints online, forming a continual evolution process without any model parameter updates. Experiments on the long-horizon suite of Minecraft MCU demonstrate consistent improvements over static-retrieval baselines.

Steve-Evolving: 細粒度診断とデュアルトラック知識蒸留によるオープンワールド具象化自己進化

Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

要旨

Support