XSkill：多模态智能体从经验与技能中持续学习

摘要

多模態智能體現已能運用多樣化工具處理複雜推理任務，但在開放式情境下仍存在工具使用效率低下與協調機制僵化的問題。實現此類智能體無需參數更新即可從過往軌跡中持續學習的關鍵挑戰在於：如何有效提取兩種互補的可重用知識——提供工具選擇與決策的行動級精要指導的「經驗」，以及提供規劃與工具使用的任務級結構化指導的「技能」。為此，我們提出雙流架構XSkill，實現多模態智能體從經驗與技能中持續學習的機制。該框架將知識提取與檢索過程錨定於視覺觀測：在積累階段，通過視覺錨定式摘要與跨軌跡批判，從多路徑推演中提煉並鞏固經驗與技能；在推理階段，根據當前視覺語境檢索並調適此類知識，同時將使用記錄反饋至積累環節，形成持續學習閉環。在四大骨幹模型上對五個跨領域基準的評估表明，XSkill始終顯著優於純工具型與基於學習的對照方法。進一步分析揭示，兩種知識流通過互補方式影響智能體推理行為，並展現出卓越的零樣本泛化能力。

English

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.

XSkill：多模态智能体从经验与技能中持续学习

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

摘要

Support