ARLArena：稳定智能体强化学习的统一框架

摘要

代理强化学习（ARL）作为一种训练智能体解决复杂多步交互任务的新范式，正迅速获得学界关注。尽管早期成果令人鼓舞，但ARL仍存在严重的不稳定性，常导致训练崩溃。这种不稳定性限制了其向更大环境规模和更长交互周期的扩展，也制约了对算法设计选择的系统性探索。本文首先提出ARLArena——一个稳定的训练方案与系统性分析框架，通过在可控可复现的环境中检验训练稳定性。ARLArena首先构建了清晰标准化的测试平台，随后将策略梯度分解为四个核心设计维度，并评估每个维度的性能与稳定性。通过这种细粒度分析，我们提炼出ARL的统一视角，进而提出SAMPO方法——一种旨在缓解ARL主要不稳定源的稳定代理策略优化算法。实验表明，SAMPO在多样化代理任务中均能实现持续稳定的训练和卓越性能。总体而言，本研究为ARL提供了统一的策略梯度视角，并为构建稳定可复现的基于大语言模型的智能体训练流程提供了实践指导。

English

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.