MMSkills: 面向通用视觉智能体的多模态技能
MMSkills: Towards Multimodal Skills for General Visual Agents
May 14, 2026
作者: Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu
cs.AI
摘要
可复用技能已成为提升智能体能力的核心基础,然而现有的大多数技能包主要将可复用行为编码为文本提示、可执行代码或学习得到的例程。但对于视觉智能体而言,程序性知识本质上是多模态的:复用不仅取决于要执行何种操作,还依赖于识别相关状态、理解代表进展或失败的视觉证据,以及决定下一步行动。我们将这一需求形式化为多模态程序性知识,并应对三个实际挑战:(I)多模态技能包应包含哪些内容;(II)这类技能包能从哪些公开交互经验中提取;(III)智能体在推理时如何参考多模态证据,同时避免过多图像上下文或过度锚定于参考截图。我们提出MMSkills框架,用于表示、生成和使用可复用的多模态程序来支持运行时视觉决策。每个MMSkill是一个紧凑的、基于条件状态的包,将文本化程序与运行时状态卡片及多视图关键帧相结合。为构建这些技能包,我们开发了一种基于智能体轨迹到技能的生成器,通过工作流分组、程序归纳、视觉定位和元技能引导审核,将公开的非评估轨迹转换为可复用的多模态技能。在使用方面,我们引入了一种分支加载式多模态技能智能体:在临时分支中检查选定的状态卡片和关键帧,与实时环境对齐,并提炼为结构化指引供主智能体使用。在GUI和游戏类视觉智能体基准测试上的实验表明,MMSkills能持续提升前沿及较小规模的多模态智能体性能,暗示外部多模态程序性知识与模型内部先验知识形成互补。
English
Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.