大型語言模型代理強化學習的現狀綜述

摘要

代理強化學習（Agentic RL）的興起標誌著從傳統應用於大型語言模型（LLM RL）的強化學習範式轉變，將LLMs從被動的序列生成器重新定位為嵌入複雜動態世界中的自主決策代理。本調查通過對比LLM-RL中退化的單步馬可夫決策過程（MDPs）與定義Agentic RL的時間延展、部分可觀測的馬可夫決策過程（POMDPs），正式化了這一概念轉變。基於此基礎，我們提出了一個全面的雙重分類法：一個圍繞核心代理能力組織，包括規劃、工具使用、記憶、推理、自我改進和感知；另一個則圍繞這些能力在各種任務領域中的應用。我們論述的核心在於，強化學習是將這些能力從靜態的啟發式模塊轉化為適應性強、魯棒的代理行為的關鍵機制。為了支持和加速未來的研究，我們將開源環境、基準測試和框架的格局整合成一個實用的彙編。通過綜合超過五百篇近期文獻，本調查描繪了這一快速發展領域的輪廓，並強調了將塑造可擴展、通用AI代理發展的機遇與挑戰。

English

The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

大型語言模型代理強化學習的現狀綜述

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

摘要

Support