エージェントスキルはテキストを超えるべきである：視覚スキルの必要性

要旨

再利用可能なスキルはエージェントの能力を拡張するための重要なメカニズムであり、エージェントが経験を蓄積し、ますます複雑なタスクを解決することを可能にする。しかし、既存のスキル学習手法の大半は、再利用可能な経験を指示、推論過程、要約された軌跡といったテキストのみの資産として記憶している。我々は、このテキスト単体のパラダイムが、視覚中心のタスクにおいて根本的なボトルネックを生み出すと主張する。なぜなら、再利用可能な知識はしばしば空間レイアウト、視覚的グラウンディング、細粒度の外観、および局所的な状態変化に依存するからである。この制限に対処するために、我々は宣言的テキストロジックと明示的な視覚的サポートを組み合わせたマルチモーダルスキルパラダイムである\NAMEを提案する。我々は三つの再利用可能な形態を区別する。すなわち、安定した空間的慣習のための静的プリオリ、その場での視覚的ワーキングメモリのための動的プリオリ、そして順序付けられたテキストステップを、それを正当化するソースフレーム、スクリーンショット、またはページ領域に結合するインターリーブ型視覚スキルである。視覚スキルは、何をすべきかを説明するだけでなく、どこを見るか、どのように検査するか、どのように視覚的な結果を検証するかもエンコードする。視覚スキルの構築を大規模化するために、我々は\SYSTEMを導入する。これは、タスク軌跡からテキスト推論、空間的参照、視覚的境界、インタラクションパターンを保持することにより、エージェントの経験を再利用可能なマルチモーダルスキルに変換する自動システムである。GUIおよびその他の視覚中心タスクにおける実験は、特に成功に空間的対応、視覚的証拠、状態認識インタラクションが必要な場合に、視覚スキルが一貫してテキストのみのスキルを上回ることを示している。これらの結果は、我々の中心的な立場を裏付けている。すなわち、再利用可能なエージェントスキルはテキストを超え、将来のマルチモーダルエージェントのためのマルチモーダル資産となるべきである。

English

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \NAME, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \SYSTEM, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.