MMG2Skill：智能體能否將野外指南蒸餾成自我進化技能？

摘要

網絡上豐富的程序性知識對於幫助智能體解決長期任務具有巨大潛力。然而，這類知識通常具有多模態、異質性、雜訊等特點，且隱含假定人類執行者，導致難以直接作為智能體所需的技能使用。為彌合人類導向指南與智能體可執行技能之間的鴻溝，我們將此問題形式化為「指南到技能學習」：將現實中的指南轉化為可執行技能，並從智能體可觀測的軌跡中持續改進。為評估現有智能體在此任務上的能力，我們提出 MMG2Skill-Bench，這是首個針對此問題設計的基準測試。我們進一步提出 MMG2Skill 閉環框架，將指南編譯為可編輯技能，在執行過程中以固定視覺語言模型智能體為基礎，並根據軌跡層級的根本原因反饋（而非使用基準分數）修訂技能。在圖形用戶界面控制、開放式遊戲與策略性卡牌遊戲中，以六種視覺語言模型為骨幹，MMG2Skill 在每個模型-領域設定下均持續優於原始基線智能體，各骨幹的宏觀平均增益介於 +12.8 至 +25.3 個百分點。消融研究顯示，直接以原始指南提示智能體可能導致性能下降，而結構化技能構建與軌跡驅動修訂均為觀察到的改進所必需。在可推斷成功與否的任務中，基於分析器的早期停止可進一步防止後期性能衰退，並在成功信號適當校準時節省 25%-53% 的嘗試次數。

English

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.