Skill0.5: 에이전트 강화 학습에서 분포 외 일반화를 위한 스킬 내재화 및 활용의 결합

초록

대규모 언어 모델에 명시적 스킬을 부여하는 것은 자율 에이전트가 복잡한 작업을 해결할 수 있게 하는 유망한 패러다임으로 부상하고 있다. 에이전트 스킬은 본질적으로 광범위한 인지 전이를 위한 일반 스킬과 동적 실행을 위한 작업 특화 스킬로 구분될 수 있다. 그러나 기존의 스킬 기반 강화학습(RL) 방법은 일반적으로 과도한 컨텍스트 오버헤드를 유발하는 완전 외재화와 과적합 및 지식 충돌의 위험이 있는 완전 내재화 사이에서 경직된 선택을 강요한다. 이러한 딜레마를 해결하기 위해 우리는 일반 스킬 내재화와 작업 특화 스킬 활용을 결합하여 스킬 처리를 명시적으로 차별화하는 새로운 에이전틱 강화학습 프레임워크인 Skill0.5를 제안한다. Skill0.5는 동적이고 난이도를 인식하는 라우터에 의해 구동되며, 작업을 별개의 숙련도 계층으로 분류하여 맞춤형 최적화 전략을 적용한다. 즉, 특권적 증류를 통해 일반 스킬을 내재화하여 어려운 작업을 위한 인지 기반을 구축하는 동시에, 쉬운 작업에는 진단적 프로빙을 사용하여 지름길을 처벌하고 특정 스킬 활용을 강제한다. ALFWorld 및 WebShop에서의 실험은 Skill0.5가 메모리 기반 및 스킬 기반 강화학습 기준선을 모두 능가하며, 분포 내 및 분포 외 시나리오 모두에서 성능 향상을 가져옴을 보여준다.

English

Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. However, existing skill-based reinforcement learning (RL) methods typically force a rigid choice between full externalization, which incurs prohibitive context overhead, and full internalization, which risks overfitting and knowledge conflicts. To address this dilemma, we propose Skill0.5, a novel agentic RL framework that explicitly differentiates skill treatments by combining general skill internalization with task-specific skill utilization. Driven by a dynamic, difficulty-aware router, Skill0.5 streams tasks into distinct mastery tiers to apply tailored optimization strategies: it internalizes general skills via privileged distillation to build a cognitive foundation for hard tasks, while using diagnostic probing on easy tasks to penalize shortcuts and enforce specific skill utilization. Experiments on ALFWorld and WebShop demonstrate that Skill0.5 outperforms both memory-based and skill-based RL baselines, yielding performance improvements across both in-distribution and out-of-distribution scenarios.