MMSkills: 범용 시각 에이전트를 위한 멀티모달 기술

초록

재사용 가능한 스킬은 에이전트의 능력을 향상시키는 핵심 기반이 되었지만, 대부분의 기존 스킬 패키지는 재사용 가능한 행동을 주로 텍스트 프롬프트, 실행 가능한 코드, 또는 학습된 루틴으로 인코딩한다. 그러나 시각적 에이전트의 경우 절차적 지식은 본질적으로 멀티모달이다. 재사용은 어떤 작업을 수행할지뿐만 아니라 관련 상태를 인식하고, 진행 상황이나 실패에 대한 시각적 증거를 해석하며, 다음에 무엇을 해야 할지 결정하는 데 달려 있다. 우리는 이러한 요구 사항을 멀티모달 절차적 지식으로 공식화하고 세 가지 실제적 과제를 다룬다: (I) 멀티모달 스킬 패키지가 무엇을 포함해야 하는지, (II) 이러한 패키지를 공개 상호작용 경험에서 어디서 파생할 수 있는지, (III) 에이전트가 추론 시점에 과도한 이미지 컨텍스트나 참조 스크린샷에 대한 과도한 고정 없이 멀티모달 증거를 어떻게 참고할 수 있는지. 우리는 재사용 가능한 멀티모달 절차를 표현, 생성, 그리고 실행 시 시각적 의사 결정에 사용하기 위한 프레임워크인 MMSkills를 소개한다. 각 MMSkill은 텍스트 절차를 실행 시 상태 카드 및 다중 뷰 키프레임과 결합한 간결한 상태 조건부 패키지이다. 이러한 패키지를 구축하기 위해 우리는 에이전트 궤적-스킬 생성기를 개발하여 공개 비평가 궤적을 워크플로 그룹화, 절차 귀납, 시각적 근거화, 메타 스킬 기반 감사를 통해 재사용 가능한 멀티모달 스킬로 변환한다. 이를 사용하기 위해 우리는 브랜치 로딩 멀티모달 스킬 에이전트를 도입한다. 선택된 상태 카드와 키프레임은 임시 브랜치에서 검사되고, 실제 환경과 정렬된 후, 주 에이전트를 위한 구조화된 지침으로 증류된다. GUI 및 게임 기반 시각 에이전트 벤치마크에 걸친 실험은 MMSkills가 최첨단 및 소규모 멀티모달 에이전트 모두를 지속적으로 개선함을 보여주며, 이는 외부 멀티모달 절차적 지식이 모델 내부 사전 지식을 보완함을 시사한다.

English

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.