ARLArena：穩定智能體強化學習的統一框架

摘要

主體性強化學習（ARL）作為一種極具前景的訓練範式，近年迅速受到關注，旨在教導智能體解決複雜的多步驟互動任務。儘管早期成果令人鼓舞，但ARL仍存在高度不穩定性，常導致訓練崩潰。此不穩定性限制了其向更大環境與更長互動週期的擴展能力，並制約了對演算法設計選擇的系統性探索。本文首先提出ARLArena——一個穩定的訓練方案與系統分析框架，能在受控且可重現的環境中檢視訓練穩定性。ARLArena首先建構了潔淨標準化的測試平台，接著將策略梯度分解為四個核心設計維度，並評估各維度的效能與穩定性。透過此細粒度分析，我們提煉出對ARL的統一觀點，進而提出SAMPO：一種穩定的主體性策略優化方法，專門用於緩解ARL中的主要不穩定來源。實證結果顯示，SAMPO在多樣化主體性任務中均能實現持續穩定的訓練與卓越效能。總體而言，本研究為ARL提供了統一的策略梯度視角，並為建構穩定可重現的基於大型語言模型之智能體訓練流程提供了實用指引。

English

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

ARLArena：穩定智能體強化學習的統一框架

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

摘要

Support