智能代理技能應超越文字：視覺技能之必要性

摘要

可复用技能是扩展智能体能力的关键机制，使智能体能够积累经验并解决日益复杂的任务。然而，现有大多数技能学习方法仅将可复用经验存储为纯文本资产（如指令、推理轨迹或摘要式行动路径）。本文认为，这种纯文本范式构成了以视觉为中心任务的根本瓶颈——此类任务中，可复用知识往往依赖于空间布局、视觉定位、细粒度外观和局部状态变化。为突破这一局限，我们提出\NAME这一多模态技能范式，将声明式文本逻辑与显式视觉支持相结合。我们区分三种可复用形式：稳定空间约定的静态先验、原位视觉工作记忆的动态先验，以及交错式视觉技能——即将有序文本步骤链接到支撑这些步骤的源帧、截图或页面区域。视觉技能不仅描述"做什么"，更编码"看哪里""如何检查"以及"如何验证视觉结果"。为规模化构建视觉技能，我们引入\SYSTEM这一自动化系统，通过从任务轨迹中保留文本推理、空间参照、视觉边界及交互模式，将智能体经验转化为可复用的多模态技能。在图形用户界面及其他以视觉为中心的任务上的实验表明，视觉技能始终优于纯文本技能——尤其在需要空间对应、视觉证据和状态感知交互的场景中。这些结果支撑了我们的核心主张：可复用的智能体技能应超越文本，成为面向未来多模态智能体的多模态资产。

English

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \NAME, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \SYSTEM, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.