MMSkills：迈向通用视觉智能体的多模态技能

摘要

可复用技能已成为提升智能体能力的核心基础，然而现有的大多数技能包主要将可复用行为编码为文本提示、可执行代码或学习到的常规流程。但对于视觉智能体而言，程序性知识本质上是多模态的：复用不仅取决于执行什么操作，还涉及识别相关状态、解读表明进展或失败的视觉证据，以及决定下一步行动。我们将这一需求形式化为多模态程序性知识，并应对三个实际挑战：（一）多模态技能包应包含什么内容；（二）从哪些公共交互经验中可以提取此类技能包；（三）智能体如何在推理阶段参考多模态证据，同时避免过多的图像上下文或过度依赖参考截图。我们提出MMSkills框架，用于表示、生成和使用可复用的多模态程序，以支持运行时视觉决策。每个MMSkill是一个紧凑的状态条件化包，将文本程序与运行时状态卡及多视角关键帧相结合。为了构建这些技能包，我们开发了一个智能体轨迹到技能的生成器，通过工作流分组、过程归纳、视觉定位和元技能引导审核，将公开的非评估轨迹转化为可复用的多模态技能。在使用方面，我们引入了一个分支加载的多模态技能智能体：在临时分支中检查选定的状态卡和关键帧，与实时环境对齐，并提炼为结构化指导供主智能体参考。在基于GUI和游戏的视觉智能体基准测试中，实验结果表明MMSkills能持续提升前沿及较小规模的多模态智能体，提示外部多模态程序性知识可补充模型内部先验知识。

English

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.