XSkill: 다중 모드 에이전트의 경험과 기술을 통한 지속적 학습

초록

다중모달 에이전트는 이제 다양한 도구를 활용해 복잡한 추론 과제를 해결할 수 있지만, 개방형 환경에서는 여전히 비효율적인 도구 사용과 경직된 조정 문제를 겪고 있습니다. 핵심 과제는 매개변수 업데이트 없이 과거 실행 경로로부터 학습하여 이러한 에이전트가 지속적으로 개선되도록 하는 것입니다. 우리는 이 목표에 필수적인 상호 보완적인 두 가지 재사용 가능한 지식 형태, 즉 도구 선택과 의사 결정을 위한 실행 수준의 간결한 지침을 제공하는 경험(experiences)과 계획 및 도구 사용을 위한 과제 수준의 구조화된 지침을 제공하는 스킬(skills)을 규명했습니다. 이를 위해 우리는 다중모달 에이전트의 경험과 스킬로부터 지속 학습을 위한 이중 흐름 프레임워크인 XSkill을 제안합니다. XSkill은 지식 추출과 검색 모두를 시각적 관측에 기반하여 구축합니다. 축적 단계에서 XSkill은 시각 기반 요약과 실행 경로 간 비판을 통해 다중 경로 실행으로부터 경험과 스킬을 정제 및 통합합니다. 추론 단계에서는 현재 시각적 맥락에 맞게 이 지식을 검색 및 적용하고, 사용 기록을 축적 과정에 피드백하여 지속 학습 루프를 형성합니다. 4가지 백본 모델을 사용하여 다양한 분야의 5개 벤치마크에서 평가한 결과, XSkill은 도구만 사용하는 방식과 학습 기반 비교 방법 모두를 지속적이고 상당히 큰 차이로 능가했습니다. 추가 분석 결과, 두 지식 흐름이 에이전트의 추론 행동에 상호 보완적인 역할을 하며 우수한 제로샷 일반화 능력을 보여주는 것으로 나타났습니다.

English

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.

XSkill: 다중 모드 에이전트의 경험과 기술을 통한 지속적 학습

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

초록

Support