PORTool: 보상 기반 트리를 활용한 도구 사용 LLM 훈련

초록

현재 도구 활용 대규모 언어 모델(LLM)은 정적 데이터셋으로 훈련되어 외부 도구와 상호작용하며 다단계의 도구 통합 추론을 수행함으로써 도구 호출 궤적을 생성합니다. 그러나 이러한 모델들은 일반적인 도구 호출 루틴에서 쿼리가 해결되는 방식을 모방할 뿐, 가능한 해결책을 탐색하지 못하며 진화하는 동적 도구 호출 환경에서 제한된 성능을 보입니다. 본 연구에서는 도구 활용 LLM이 정답을 도출하는 다양한 궤적을 탐색하도록 유도하는 강화 학습(RL) 기법인 PORTool을 제안합니다. 구체적으로, 이 방법은 주어진 쿼리에 대해 여러 롤아웃을 생성하는 것으로 시작하며, 이들 중 일부는 초기 몇 단계의 도구 호출 단계를 공유하여 트리 구조를 형성합니다. 다음으로 각 단계가 정답을 생성하고 성공적인 도구 호출을 수행하는 능력을 기반으로 보상을 부여합니다. 서로 다른 궤적에서 공유되는 단계는 동일한 보상을 받는 반면, 동일한 분기점 아래의 서로 다른 단계는 다른 보상을 받습니다. 마지막으로, 이러한 단계별 보상은 도구 사용을 위해 LLM을 훈련시키기 위해 궤적 상대적 이점과 혼합된 분기 상대적 이점을 계산하는 데 사용됩니다. 실험에서는 시간 민감 및 시간 불변 주제를 모두 포괄하는 17가지 도구를 활용하여 사용자 쿼리를 해결합니다. 단계별 보상의 필요성과 설계 견고성을 체계적으로 입증하기 위해 애블레이션 연구를 수행합니다. 더 나아가, 제안된 PORTool을 다른 훈련 접근법과 비교하여 최종 정확도와 도구 호출 단계 수에서의 유의미한 개선을 입증합니다.

English

Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.

PORTool: 보상 기반 트리를 활용한 도구 사용 LLM 훈련

PORTool: Tool-Use LLM Training with Rewarded Tree

초록

Support