MMG2Skill: エージェントは実世界のガイドを自己進化スキルに蒸留できるか？

要旨

ウェブ上に豊富に存在する手続き的知識は、エージェントが長期的タスクを解決する上で大きな可能性を秘めている。しかし、そのような知識は多くの場合、マルチモーダルで異種混合、ノイズが多く、暗黙のうちに人間の実行主体を前提としているため、エージェントに要求されるスキルとして直接利用することは困難である。人間向けのガイドとエージェント実行可能なスキルの間のギャップを埋めるために、我々はこの問題をガイドからスキルへの学習として定式化する。すなわち、実世界のガイドを実行可能なスキルに変換し、エージェントが観測可能な軌跡からそれらを継続的に改善する。既存のエージェントのこのタスクにおける能力を評価するために、我々はこの問題向けに設計された初のベンチマークであるMMG2Skill-Benchを導入する。さらに我々はMMG2Skillを提案する。これは閉ループフレームワークであり、ガイドを編集可能なスキルにコンパイルし、実行中に固定された視覚言語モデル(VLM)エージェントをこれらのスキルで条件付け、ベンチマークスコアを使用せずに軌跡レベルの根本原因フィードバックからスキルを修正する。GUI制御、オープンエンドなゲームプレイ、戦略的カードプレイにおいて、6つのVLMバックボーンを用いた実験の結果、MMG2Skillはすべてのモデル・ドメイン設定において標準ベースラインエージェントを一貫して上回り、バックボーン全体でマクロ平均で+12.8から+25.3パーセンテージポイントの向上を達成した。アブレーション研究により、生のガイドを直接エージェントにプロンプトとして与えると性能が低下する可能性がある一方、観測された改善には構造化されたスキル構築と軌跡駆動型修正の両方が必要であることが示された。成功推論可能タスクでは、アナライザーベースの早期停止により、後期の性能低下をさらに防ぎ、成功信号が適切に較正された場合に試行の25%から53%を節約できる。

English

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.