OpenClaw-RL：透過自然語言對話訓練任意智能體

摘要

每次智能體互動都會產生下一狀態信號——即用戶回覆、工具輸出、終端或圖形介面狀態變化等行動後續反應，但現有強化學習系統均未將其作為即時線上學習來源進行採集。我們提出OpenClaw-RL框架，其核心基於一個簡單發現：下一狀態信號具有普適性，策略可同步從所有信號中學習。個人對話、終端執行、圖形介面互動、軟體工程任務與工具調用軌跡並非獨立的訓練問題，而是能在同一循環中訓練同一策略的互動資源。下一狀態信號編碼了兩種信息：評估信號（通過PRM評判器提取為標量獎勵，反映行動效果優劣）與指導信號（通過 hindsight-guided 策略蒸餾技術還原行動應有改進方向）。我們從下一狀態提取文本提示，構建增強型教師上下文，提供比標量獎勵更豐富的詞元級方向優勢監督。憑藉異步設計，模型處理實時請求、PRM評判持續互動、訓練器更新策略三者可同步進行，且無需協調開銷。應用於個人助理時，OpenClaw-RL使智能體能通過日常使用自我提升，從用戶重複查詢、修正與顯式反饋中提取對話信號；應用於通用智能體時，同一基礎架構支持終端、圖形介面、軟體工程及工具調用場景的可擴展強化學習，並額外驗證了過程獎勵的實用性。程式碼開源於：https://github.com/Gen-Verse/OpenClaw-RL

English

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL

OpenClaw-RL：透過自然語言對話訓練任意智能體

OpenClaw-RL: Train Any Agent Simply by Talking

摘要

Support