智能体技能应超越文本：视觉技能的必要性

摘要

可复用技能是扩展智能体能力的关键机制，使智能体能积累经验并解决日益复杂的任务。然而，现有大多数技能学习方法仅将可复用经验存储为纯文本资产，例如指令、推理轨迹或轨迹摘要。我们认为，这种纯文本范式在以视觉为中心的任务中造成了根本性瓶颈——这类任务的可复用知识往往依赖于空间布局、视觉定位、细粒度外观及局部状态变化。为突破这一局限，我们提出\NAME，一种融合声明性文本逻辑与显式视觉支持的多模态技能范式。我们区分出三种可复用形式：用于稳定空间惯例的静态先验、用于原位视觉工作记忆的动态先验，以及将有序文本步骤与源帧、截图或页面区域（作为步骤合理性依据）绑定的交错视觉技能。视觉技能不仅描述“做什么”，还编码“看哪里”“如何检查”以及“如何验证视觉结果”。为扩展视觉技能构建规模，我们引入\SYSTEM自动化系统，通过保留任务轨迹中的文本推理、空间引用、视觉边界及交互模式，将智能体经验转化为可复用多模态技能。在图形用户界面及其他以视觉为中心的任务上的实验表明，视觉技能始终优于纯文本技能——尤其当任务成功需要空间对应关系、视觉证据及状态感知交互时。这些结果支撑了我们的核心论点：可复用智能体技能应当超越文本，成为未来多模态智能体的多模态资产。

English

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \NAME, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \SYSTEM, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.