XSkill: マルチモーダルエージェントにおける経験とスキルからの継続的学習

要旨

マルチモーダルエージェントは現在、多様なツールを用いて複雑な推論タスクに取り組むことが可能であるが、未だにオープンエンドな設定において非効率なツール使用と硬直的なオーケストレーションに課題を抱えている。中心的な課題は、過去の行動軌跡から学習することで、パラメータ更新なしにこのようなエージェントが継続的に改善できるようにすることである。我々は、この目標に不可欠な補完的な二つの再利用可能な知識形態を特定した。すなわち、ツール選択と意思決定に対する簡潔なアクションレベルの指針を提供する「経験」と、計画とツール使用に対する構造化されたタスクレベルの指針を提供する「スキル」である。この目的のために、我々はマルチモーダルエージェントにおける経験とスキルからの継続学習のためのデュアルストリームフレームワーク「XSkill」を提案する。XSkillは、知識の抽出と検索の両方を視覚的観察に基づいて行う。蓄積段階では、XSkillは、視覚に基づく要約とクロスロールアウト批評を通じて、複数経路のロールアウトから経験とスキルを蒸留・統合する。推論段階では、この知識を現在の視覚的コンテキストに基づいて検索・適応させ、使用履歴を蓄積段階にフィードバックして継続学習ループを形成する。4つの基盤モデルを用いた多様な領域にわたる5つのベンチマークで評価した結果、XSkillはツールのみのベースラインおよび学習ベースのベースラインの両方を一貫して大幅に上回った。さらなる分析により、二つの知識ストリームがエージェントの推論行動に影響を与える上で補完的な役割を果たし、優れたゼロショット汎化性能を示すことが明らかになった。

English

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.

XSkill: マルチモーダルエージェントにおける経験とスキルからの継続的学習

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

要旨

Support