대규모 언어 모델의 도구 사용을 위한 인-컨텍스트 강화 학습

초록

대규모 언어 모델(LLM)은 강력한 추론 능력을 보여주지만, 복잡한 작업에서의 성능은 종종 내부 지식의 한계에 의해 제약을 받습니다. 이러한 과제를 극복하기 위한 효과적인 접근법은 외부 도구를 활용하여 모델을 보강하는 것입니다. 예를 들어 수학적 계산을 위한 파이썬 인터프리터나 사실 정보 검색을 위한 검색 엔진이 여기에 해당합니다. 그러나 모델이 이러한 도구를 효과적으로 사용하도록 만드는 것은 여전히 중요한 과제로 남아 있습니다. 기존 방법들은 일반적으로 지도 미세 조정(SFT)으로 시작하여 강화 학습(RL)을 이어가는 콜드-스타트 파이프라인에 의존합니다. 이러한 접근법은 SFT를 위해 상당한 양의 레이블 데이터를 필요로 하는 경우가 많으며, 이 데이터는 주석 처리나 합성에 비용이 많이 듭니다. 본 연구에서는 RL의 롤아웃 단계에서 퓨샷 프롬프팅을 활용하여 SFT 필요성을 제거하는 RL-전용 프레임워크인 In-Context Reinforcement Learning(ICRL)을 제안합니다. 구체적으로 ICRL은 롤아웃 프롬프트 내에 인-컨텍스트 예시를 도입하여 모델이 외부 도구를 호출하는 방법을 학습하도록 합니다. 더 나아가, 훈련이 진행됨에 따라 인-컨텍스트 예시의 수를 점진적으로 줄여 결국 모델이 도구를 독립적으로 호출하는 방법을 학습하는 제로샷 환경에 도달하도록 합니다. 다양한 추론 및 도구 사용 벤치마크를 통해 광범위한 실험을 수행한 결과, ICRL이 최첨단 성능을 달성하여 기존 SFT 기반 파이프라인에 대한 확장성 있고 데이터 효율적인 대안으로서의 효과성을 입증했습니다.

English

While large language models (LLMs) exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.

대규모 언어 모델의 도구 사용을 위한 인-컨텍스트 강화 학습

In-Context Reinforcement Learning for Tool Use in Large Language Models

초록

Support