OpenClaw-RL：通过对话训练任意智能体

摘要

每次智能体交互都会产生一个状态转移信号——即用户回复、工具输出、终端或图形界面状态变化等行动后的反馈，然而现有强化学习系统均未将其作为实时在线学习资源进行利用。我们提出OpenClaw-RL框架，其核心基于一个简明发现：状态转移信号具有普适性，策略可以从所有信号中同步学习。私人对话、终端执行、图形界面交互、软件工程任务与工具调用轨迹并非独立的训练课题，而是能在同一循环中训练同一策略的交互资源。这些信号编码着双重信息：评估信号（通过PRM评判器提取标量奖励以反映行动效果）与指导信号（通过 hindsight-guided 策略蒸馏技术揭示行动改进方向）。我们从后续状态提取文本提示，构建增强型教学上下文，提供比标量奖励更丰富的词级方向性优势监督。得益于异步架构，模型可同时处理实时请求、PRM评判持续交互、训练器更新策略，三者间实现零协调开销。应用于个人助手时，OpenClaw-RL使智能体仅通过日常使用就能持续进化，从用户重复查询、修正指令和显式反馈中提取对话信号；应用于通用智能体时，同一基础设施支持终端、图形界面、软件工程及工具调用场景的可扩展强化学习，其中我们还验证了过程奖励的有效性。代码地址：https://github.com/Gen-Verse/OpenClaw-RL

English

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL