ChatPaper.aiChatPaper

大型语言模型工具使用的上下文强化学习

In-Context Reinforcement Learning for Tool Use in Large Language Models

March 9, 2026
作者: Yaoqi Ye, Yiran Zhao, Keyu Duan, Zeyu Zheng, Kenji Kawaguchi, Cihang Xie, Michael Qizhe Shieh
cs.AI

摘要

尽管大型语言模型(LLM)展现出强大的推理能力,但其在复杂任务上的表现常受限于内部知识的不足。解决这一挑战的有效途径是为模型增强外部工具支持——例如利用Python解释器进行数学计算,或通过搜索引擎获取事实信息。然而,如何使模型有效调用这些工具仍是重要难题。现有方法通常采用冷启动流程:先进行监督微调(SFT),再实施强化学习(RL)。这类方法往往需要大量标注数据进行SFT,其标注或合成成本高昂。本研究提出上下文强化学习(ICRL),这一纯强化学习框架通过在RL推演阶段采用少量示例提示,消除了对SFT的依赖。具体而言,ICRL在推演提示中引入上下文示例,指导模型如何调用外部工具。随着训练推进,上下文示例数量逐步减少,最终实现模型在零样本环境下独立调用工具。我们在多项推理与工具使用基准测试中展开实验,结果表明ICRL实现了最先进的性能,证明了其作为可扩展、高数据效率的传统SFT流程替代方案的有效性。
English
While large language models (LLMs) exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.
PDF191March 13, 2026