MMG2Skill：智能体能否从现实指南中提炼出自我进化技能？

摘要

网络上丰富的程序性知识对于帮助智能体解决长期任务具有巨大潜力。然而，这类知识往往呈现多模态、异质、带有噪声的特性，且默认由人类执行者操作，因此难以直接作为智能体所需的技能使用。为弥合面向人类指南与智能体可执行技能之间的鸿沟，我们将该问题形式化为"指南到技能学习"：将现实指南转化为可执行技能，并从智能体可观察的轨迹中持续改进这些技能。为评估现有智能体在此任务上的能力，我们首次针对该问题设计了基准测试集MMG2Skill-Bench。我们进一步提出MMG2Skill框架，这是一个闭环系统，可将指南编译为可编辑技能，在任务执行期间用这些技能条件化固定的视觉语言模型（VLM）智能体，并通过轨迹级根因反馈（而非基准测试分数）来修正技能。在图形用户界面控制、开放式游戏和策略卡牌游戏三类场景中，结合六种VLM骨干网络，MMG2Skill在每个模型-域设置下均持续优于原始基线智能体，在所有骨干网络上实现宏观平均增益12.8至25.3个百分点。消融研究表明，直接使用原始指南提示智能体反而会降低性能，而结构化技能构建与轨迹驱动修正对于观察到的改进均不可或缺。在成功可推断的任务中，基于分析器的提前停止可进一步防止后期性能退化，并在成功信号校准得当的情况下节省25%至53%的尝试次数。

English

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.