에이전트 스킬은 텍스트를 넘어서야 한다: 시각적 스킬의 필요성

초록

재사용 가능한 스킬은 에이전트의 역량을 확장하는 핵심 메커니즘으로, 에이전트가 경험을 축적하고 점점 더 복잡한 작업을 해결할 수 있게 한다. 그러나 기존 대부분의 스킬 학습 방법은 재사용 가능한 경험을 명령어, 추론 과정 또는 요약된 궤적과 같은 텍스트 전용 자산으로만 저장한다. 우리는 이러한 텍스트 전용 패러다임이 시각 중심 작업에서 근본적인 병목 현상을 유발한다고 주장한다. 재사용 가능한 지식은 종종 공간적 배치, 시각적 근거, 세부적인 외형, 그리고 국소적 상태 변화에 의존하기 때문이다. 이러한 한계를 해결하기 위해, 우리는 선언적 텍스트 논리와 명시적 시각적 지원을 결합한 다중 모달 스킬 패러다임인 \NAME을 제안한다. 우리는 세 가지 재사용 가능한 형태를 구분한다: 안정적인 공간적 관례를 위한 정적 사전 지식, 현장 시각적 작업 기억을 위한 동적 사전 지식, 그리고 순차적 텍스트 단계를 이를 정당화하는 원본 프레임, 스크린샷 또는 페이지 영역에 결합하는 혼합형 시각 스킬. 시각 스킬은 무엇을 해야 하는지 설명할 뿐만 아니라, 어디를 봐야 하는지, 어떻게 조사해야 하는지, 그리고 시각적 결과를 어떻게 확인해야 하는지도 인코딩한다. 시각 스킬 구축을 확장하기 위해, 우리는 에이전트 경험을 재사용 가능한 다중 모달 스킬로 자동 변환하는 시스템인 \SYSTEM을 소개한다. 이 시스템은 작업 궤적으로부터 텍스트 추론, 공간적 참조, 시각적 경계, 그리고 상호작용 패턴을 보존한다. GUI 및 기타 시각 중심 작업에 대한 실험 결과는 시각 스킬이 텍스트 전용 스킬보다 일관되게 우수한 성능을 보이며, 특히 공간적 대응, 시각적 증거, 그리고 상태 인식 상호작용이 필요한 성공 조건에서 두드러짐을 보여준다. 이러한 결과는 우리의 핵심 주장을 뒷받침한다: 재사용 가능한 에이전트 스킬은 텍스트를 넘어 미래의 다중 모달 에이전트를 위한 다중 모달 자산이 되어야 한다.

English

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \NAME, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \SYSTEM, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.