強化学習によるLLMのための主体的推論とツール統合

要旨

大規模言語モデル（LLMs）は、複雑な推論タスクにおいて顕著な進歩を遂げてきたが、依然として静的で内部的な知識とテキストのみの推論に依存しているという根本的な制約を抱えている。現実世界の問題解決では、動的で多段階の推論、適応的な意思決定、外部ツールや環境との相互作用能力がしばしば求められる。本研究では、ARTIST（Agentic Reasoning and Tool Integration in Self-improving Transformers）を提案する。これは、エージェント的推論、強化学習、およびツール統合を密接に連携させた統一フレームワークである。ARTISTは、モデルが多段階の推論連鎖において、いつ、どのように、どのツールを呼び出すかを自律的に決定することを可能にし、結果ベースの強化学習を活用して、ステップレベルの監視を必要とせずにツール使用と環境相互作用のための堅牢な戦略を学習する。数学的推論および多段階関数呼び出しベンチマークにおける広範な実験により、ARTISTが最先端のベースラインを一貫して上回り、ベースモデルに対して最大22%の絶対的な改善と、最も困難なタスクにおいても大きな向上を示すことが明らかになった。詳細な研究とメトリック分析により、エージェント的強化学習トレーニングが、より深い推論、より効果的なツール使用、およびより高品質な解決策をもたらすことが示された。我々の結果は、ツール統合を伴うエージェント的強化学習が、LLMsにおける堅牢で解釈可能かつ汎用的な問題解決のための強力な新たなフロンティアであることを確立する。

English

Large language models (LLMs) have achieved remarkable progress in complex reasoning tasks, yet they remain fundamentally limited by their reliance on static internal knowledge and text-only reasoning. Real-world problem solving often demands dynamic, multi-step reasoning, adaptive decision making, and the ability to interact with external tools and environments. In this work, we introduce ARTIST (Agentic Reasoning and Tool Integration in Self-improving Transformers), a unified framework that tightly couples agentic reasoning, reinforcement learning, and tool integration for LLMs. ARTIST enables models to autonomously decide when, how, and which tools to invoke within multi-turn reasoning chains, leveraging outcome-based RL to learn robust strategies for tool use and environment interaction without requiring step-level supervision. Extensive experiments on mathematical reasoning and multi-turn function calling benchmarks show that ARTIST consistently outperforms state-of-the-art baselines, with up to 22% absolute improvement over base models and strong gains on the most challenging tasks. Detailed studies and metric analyses reveal that agentic RL training leads to deeper reasoning, more effective tool use, and higher-quality solutions. Our results establish agentic RL with tool integration as a powerful new frontier for robust, interpretable, and generalizable problem-solving in LLMs.