MMG2Skill: 에이전트가 실전 가이드를 자기 진화 기술로 증류할 수 있는가?

초록

웹상의 풍부한 절차적 지식은 에이전트가 장기 과제를 해결하는 데 큰 잠재력을 지니고 있다. 그러나 이러한 지식은 종종 멀티모달, 이질적, 노이즈가 많으며, 인간 실행자를 암묵적으로 가정하기 때문에 에이전트가 필요로 하는 스킬로 직접 사용하기 어렵다. 인간 중심 가이드와 에이전트 실행 가능 스킬 간의 격차를 해소하기 위해, 우리는 이 문제를 가이드-스킬 학습으로 공식화한다: 실세계 가이드를 실행 가능한 스킬로 변환하고 에이전트가 관찰 가능한 궤적으로부터 지속적으로 개선하는 것이다. 이 과제에 대한 기존 에이전트의 능력을 평가하기 위해, 우리는 이 문제를 위해 설계된 최초의 벤치마크인 MMG2Skill-Bench를 소개한다. 또한 MMG2Skill을 제안하는데, 이는 가이드를 편집 가능한 스킬로 컴파일하고, 실행 중에 고정된 시각-언어 모델(VLM) 에이전트를 이러한 스킬에 조건화하며, 벤치마크 점수를 사용하지 않고 궤적 수준의 근본 원인 피드백으로부터 스킬을 수정하는 폐쇄 루프 프레임워크이다. 여섯 개의 VLM 백본을 활용한 GUI 제어, 개방형 게임플레이, 전략적 카드 게임 전반에 걸쳐, MMG2Skill은 모든 모델-도메인 설정에서 바닐라 기준 에이전트를 일관되게 능가하며, 백본 간 매크로 평균 향상도가 +12.8~+25.3% 포인트에 달한다. 절제 연구에 따르면, 원시 가이드를 에이전트에 직접 프롬프트로 제공하면 성능이 저하될 수 있으며, 관찰된 개선에는 구조화된 스킬 구성과 궤적 기반 수정이 모두 필요하다. 성공 추론 가능 과제에서는 분석기 기반 조기 중단이 후기 단계의 성능 후퇴를 추가로 방지하며, 성공 신호가 적절히 보정될 경우 시도 횟수의 25%~53%를 절약한다.

English

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.