OpenClaw-RL: Train Elke Agent Eenvoudig door te Praten

Samenvatting

Elke agentinteractie genereert een next-state signaal, namelijk de gebruikersreactie, tooloutput, terminal- of GUI-toestandsverandering die volgt op elke actie, maar geen enkel bestaand agentief RL-systeem benut dit als een live, online leerbron. Wij presenteren OpenClaw-RL, een raamwerk gebaseerd op een eenvoudige observatie: next-state signalen zijn universeel, en een beleid kan van allemaal tegelijkertijd leren. Persoonlijke gesprekken, terminaluitvoeringen, GUI-interacties, SWE-taken en tool-call traces zijn geen afzonderlijke trainingsproblemen. Het zijn allemaal interacties die gebruikt kunnen worden om hetzelfde beleid in dezelfde lus te trainen. Next-state signalen coderen twee vormen van informatie: evaluatieve signalen, die aangeven hoe goed de actie presteerde en worden geëxtraheerd als scalaire beloningen via een PRM-beoordelaar; en directieve signalen, die aangeven hoe de actie anders had moeten zijn en worden hersteld door Hindsight-Guided On-Policy Distillation (OPD). Wij extraheren tekstuele hints uit de volgende staat, construeren een verbeterde docentcontext en voorzien token-level directioneel voordeel supervisie die rijker is dan enige scalaire beloning. Dankzij het asynchrone ontwerp verwerkt het model live verzoeken, beoordeelt de PRM doorlopende interacties, en update de trainer het beleid tegelijkertijd, zonder enige coördinatie-overhead tussen hen. Toegepast op persoonlijke agenten stelt OpenClaw-RL een agent in staat om simpelweg te verbeteren door gebruikt te worden, waarbij conversatiesignalen worden hersteld uit gebruikershervragen, correcties en expliciete feedback. Toegepast op algemene agenten ondersteunt dezelfde infrastructuur schaalbare RL over terminal-, GUI-, SWE- en tool-call settings, waar wij bovendien het nut van procesbeloningen aantonen. Code: https://github.com/Gen-Verse/OpenClaw-RL

English

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL

OpenClaw-RL: Train Elke Agent Eenvoudig door te Praten

OpenClaw-RL: Train Any Agent Simply by Talking

Samenvatting

Support