ARLArena: Een Uniform Kader voor Stabiele Agent-gebaseerde Versterkingsleren

Samenvatting

Agentische reinforcement learning (ARL) heeft snel aandacht gekregen als een veelbelovend paradigma voor het trainen van agents om complexe, meerstaps interactieve taken op te lossen. Ondanks bemoedigende vroege resultaten blijft ARL zeer instabiel, wat vaak leidt tot trainingsinstorting. Deze instabiliteit beperkt de schaalbaarheid naar grotere omgevingen en langere interactiehorizons, en belemmert een systematische verkenning van algoritmische ontwerpkeuzes. In dit artikel stellen we eerst ARLArena voor, een stabiel trainingsrecept en systematisch analysekader dat trainingsstabiliteit onderzoekt in een gecontroleerde en reproduceerbare setting. ARLArena construeert eerst een schone en gestandaardiseerde testomgeving. Vervolgens ontleden we policy gradient in vier kernontwerpdimensions en beoordelen we de prestaties en stabiliteit van elke dimensie. Via deze fijnmazige analyse destilleren we een verenigend perspectief op ARL en stellen we SAMPO voor, een stabiele agentische policy optimalisatiemethode ontworpen om de belangrijkste bronnen van instabiliteit in ARL te mitigeren. Empirisch gezien bereikt SAMPO consistent stabiele training en sterke prestaties in diverse agentische taken. Over het geheel genomen biedt deze studie een verenigend policy gradient-perspectief voor ARL en praktische richtlijnen voor het bouwen van stabiele en reproduceerbare op LLM gebaseerde agent-trainingspijplijnen.

English

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

ARLArena: Een Uniform Kader voor Stabiele Agent-gebaseerde Versterkingsleren

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Samenvatting

Support