大型語言模型中工具使用的上下文強化學習（注：這裏的"In-Context Reinforcement Learning"採用"上下文強化學習"的譯法，既保留了強化學習的專業術語特徵，又通過"上下文"準確傳達了在特定語境下進行學習的含義。"Tool Use"譯為"工具使用"符合AI領域對工具調用能力的標準表述。）

摘要

儘管大型語言模型展現出強大的推理能力，但其在複雜任務上的表現往往受制於內在知識的侷限性。為克服此挑戰，一種有效的方法是透過外部工具增強模型能力——例如使用Python直譯器進行數學運算，或利用搜尋引擎獲取事實資訊。然而，如何讓模型有效運用這些工具仍是重大難題。現有方法通常採用冷啟動流程：先進行監督式微調，再實施強化學習。這類方法往往需要大量標註數據進行監督式微調，而數據標註或合成的成本極高。本研究提出「情境內強化學習」，此純強化學習框架透過在強化學習的推論階段採用少樣本提示，消除了對監督式微調的依賴。具體而言，ICRL在推論提示中引入情境範例，教導模型如何調用外部工具。隨著訓練推進，情境範例的數量會逐步遞減，最終達到模型能獨立呼叫工具的零樣本設定。我們在多項推理與工具使用基準測試中進行廣泛實驗，結果顯示ICRL實現了最先進的性能，證明其可作為傳統基於監督式微調流程的可擴展、高數據效率替代方案。

English

While large language models (LLMs) exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.

In-Context Reinforcement Learning for Tool Use in Large Language Models

摘要

Support