XSkill：多模态智能体从经验与技能中持续学习

摘要

多模态智能体当前已能运用多样化工具处理复杂推理任务，但在开放场景下仍存在工具使用效率低下与调度机制僵化的问题。实现此类智能体无需参数更新即可通过历史轨迹持续自我提升，是核心挑战所在。我们识别出实现该目标必需的两种互补型可复用知识：提供工具选择与决策的行动级精要指导的"经验"，以及提供任务规划与工具使用的结构化指导的"技能"。为此，我们提出双通道框架XSkill，实现多模态智能体从经验与技能中持续学习。该框架将知识提取与检索均锚定于视觉观察：在积累阶段，通过视觉锚定摘要与跨轨迹评估，从多路径推演中提炼固化经验与技能；在推理阶段，根据当前视觉语境检索调适知识，并将使用记录反馈至积累阶段形成持续学习闭环。在四大骨干模型、五大跨领域基准测试中，XSkill均显著优于纯工具型及学习型基线方法。进一步分析表明，双知识流通过互补方式影响智能体推理行为，并展现出卓越的零样本泛化能力。

English

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.

XSkill：多模态智能体从经验与技能中持续学习

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

摘要

Support