ChatPaper.aiChatPaper

技能0:情境化智能體強化學習實現技能內化

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

April 2, 2026
作者: Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
cs.AI

摘要

代理技能——即程序性知識與可執行資源的結構化封裝體(代理在推理時動態加載),已成為增強大型語言模型代理能力的可靠機制。然而推理時的技能增強存在根本性局限:檢索噪聲會引入無關指導、注入的技能內容帶來大量標記開銷,且模型從未真正掌握其僅被動遵循的知識。我們提出一種替代方案:能否將技能內化至模型參數中,實現無需運行時技能檢索的零樣本自主行為?為此我們引入SKILL0——專為技能內化設計的情境強化學習框架。該框架採用訓練階段的課程學習策略:從完整技能上下文開始,逐步撤除輔助信息。技能按類別離線分組,並結合互動歷史渲染為緊湊可視化上下文,教導模型進行工具調用與多輪任務完成。動態課程機制會評估每個技能文件的策略幫助度,僅保留當前策略在線性衰減預算內仍能受益的技能,直至代理能在完全零樣本環境中運行。大量代理實驗表明,SKILL0相較標準強化學習基線實現顯著提升(ALFWorld提升9.7%,Search-QA提升6.6%),同時保持每步少於0.5k標記的高效上下文使用。程式碼已開源於:https://github.com/ZJU-REAL/SkillZero。
English
Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.
PDF723April 4, 2026