Skill0.5：針對智能體強化學習中分布外泛化的聯合技能內化與運用

摘要

為大型語言模型配備顯式技能，已成為一種有前景的範式，使自主代理能夠解決複雜任務。代理技能本質上可分為兩類：用於廣泛認知遷移的通用技能，以及用於動態執行的任務特定技能。然而，現有的基於技能的強化學習方法通常強制在完全外化（會帶來過高的上下文開銷）與完全內化（可能導致過擬合與知識衝突）之間做出僵化選擇。為解決此困境，我們提出Skill0.5，這是一種新穎的代理強化學習框架，透過結合通用技能內化與任務特定技能利用，明確區分不同技能的處理方式。在動態、難度感知路由器的驅動下，Skill0.5將任務分流至不同的掌握層級，並採用量身訂製的優化策略：對困難任務，透過特權蒸餾內化通用技能，以建立認知基礎；對簡單任務，則利用診斷探測來懲罰捷徑並強制使用特定技能。在ALFWorld和WebShop上的實驗表明，Skill0.5在分佈內與分佈外場景中均優於基於記憶與基於技能的強化學習基準，展現出性能提升。

English

Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. However, existing skill-based reinforcement learning (RL) methods typically force a rigid choice between full externalization, which incurs prohibitive context overhead, and full internalization, which risks overfitting and knowledge conflicts. To address this dilemma, we propose Skill0.5, a novel agentic RL framework that explicitly differentiates skill treatments by combining general skill internalization with task-specific skill utilization. Driven by a dynamic, difficulty-aware router, Skill0.5 streams tasks into distinct mastery tiers to apply tailored optimization strategies: it internalizes general skills via privileged distillation to build a cognitive foundation for hard tasks, while using diagnostic probing on easy tasks to penalize shortcuts and enforce specific skill utilization. Experiments on ALFWorld and WebShop demonstrate that Skill0.5 outperforms both memory-based and skill-based RL baselines, yielding performance improvements across both in-distribution and out-of-distribution scenarios.