OpenClaw-RL：会話だけであらゆるエージェントを訓練

要旨

すべてのエージェント相互作用は、次の状態シグナル（ユーザーの返信、ツール出力、各アクション後に生じるターミナルやGUIの状態変化）を生成します。しかし、既存のエージェント強化学習システムは、これをオンラインのライブ学習ソースとして活用していません。我々はOpenClaw-RLを提案します。このフレームワークは、次の単純な観察に基づいています：次の状態シグナルは普遍的であり、ポリシーはそれらすべてから同時に学習できます。個人的な会話、ターミナル実行、GUI操作、ソフトウェアエンジニアリングタスク、ツール呼び出しの痕跡は、それぞれ別々の学習問題ではありません。これらはすべて、同じループ内で同一のポリシーを訓練するために利用可能な相互作用なのです。次の状態シグナルは、2種類の情報を符号化しています。第一に、**評価的シグナル**（アクションの良し悪しを示し、PRM judgeによってスカラー報酬として抽出される）と、第二に、**指示的シグナル**（アクションがどのように異なるべきであったかを示し、Hindsight-Guided On-Policy Distillation (OPD) を通じて回収される）です。我々は次の状態からテキストによるヒントを抽出し、強化された教師コンテキストを構築し、あらゆるスカラー報酬よりも豊富な、トークンレベルの方向性のあるアドバンテージ監督を提供します。非同期設計により、モデルはライブリクエストを処理し、PRM judgeは進行中の相互作用を評価し、トレーナーはポリシーを同時に更新します。これら3つの間の調整オーバーヘッドは完全にゼロです。個人向けエージェントに適用すると、OpenClaw-RLはエージェントが単に使用されるだけで改善することを可能にし、ユーザーの再クエリ、修正、明示的フィードバックから会話シグナルを回収します。汎用エージェントに適用すると、同じインフラストラクチャが、ターミナル、GUI、ソフトウェアエンジニアリング、ツール呼び出しの設定にわたるスケーラブルな強化学習をサポートし、そこで我々はプロセス報酬の有用性も実証します。コード: https://github.com/Gen-Verse/OpenClaw-RL

English

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL

OpenClaw-RL：会話だけであらゆるエージェントを訓練

OpenClaw-RL: Train Any Agent Simply by Talking

要旨

Support