MMSkills: 汎用視覚エージェントのためのマルチモーダルスキルに向けて

要旨

再利用可能なスキルはエージェントの能力向上における中核基盤となっているが、既存のスキルパッケージのほとんどは、再利用可能な振る舞いを主にテキストプロンプト、実行可能コード、または学習されたルーチンとして符号化している。しかし、視覚エージェントにとって、手続き的知識は本質的にマルチモーダルである。再利用は、どの操作を実行するかだけでなく、関連する状態の認識、進行や失敗の視覚的証拠の解釈、次に何をすべきかの決定にも依存するからである。我々はこの要件をマルチモーダル手続き的知識として形式化し、以下の3つの実用的課題に取り組む。(I) マルチモーダルスキルパッケージは何を含むべきか、(II) そのようなパッケージを公開されたインタラクション経験からどのように導出できるか、(III) エージェントが推論時に過剰な画像コンテキストや参照スクリーンショットへの過度な固定なしにマルチモーダル証拠をどのように参照できるか。我々はMMSkillsを紹介する。これは、実行時視覚的意思決定のための再利用可能なマルチモーダル手続きを表現、生成、使用するためのフレームワークである。各MMSkillは、テキスト手続きと実行時状態カードおよび多視点キーフレームを結合した、コンパクトで状態条件付きのパッケージである。これらのパッケージを構築するために、我々はエージェント軌跡からスキルへの生成器を開発する。これは、公開された非評価用軌跡を、ワークフローグループ化、手続き帰納、視覚的グラウンディング、メタスキル誘導監査を通じて再利用可能なマルチモーダルスキルに変換する。これらを使用するために、我々はブランチロード型マルチモーダルスキルエージェントを導入する。選択された状態カードとキーフレームが一時的ブランチで検査され、実環境と位置合わせされ、メインエージェント向けの構造化されたガイダンスに蒸留される。GUIおよびゲームベースの視覚エージェントベンチマークにわたる実験により、MMSkillsが最先端および小型のマルチモーダルエージェントの両方を一貫して改善することが示され、外部マルチモーダル手続き的知識がモデル内部の事前知識を補完することを示唆している。

English

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.