スパースなエピソード結果に基づくVLAのオンラインRLファインチューニングのための階層的アドバンテージ重み付け

要旨

事前学習済みVLA方策をオンライン強化学習で微調整する場合、各ロールアウトエピソードからは単一の二値結果（成功または失敗）しか得られないが、アクター更新には各遷移に対する教師信号が必要となる。既存手法では、この疎な結果を単一のスカラー報酬またはアドバンテージ信号に還元することが多く、これにより異なる種類の遷移レベルのフィードバックが混同され、基本的なタスク成功が達成可能になると限定的な指針しか与えられない。第一に、単一スカラー信号は生存性と効率性という二つの目的を混同する。基本的成功が達成されると、二値ラベルは効率的な完了と遅い完了を区別する勾配を提供しなくなる。第二に、実環境でのロールアウトは自律セグメントと介入セグメントが混在しており、これらの境界をまたいでエピソード結果を単純に割り当てると、誤ったクレジット割り当てが生じる。これらの問題に対処するため、本論文では階層的アドバンテージ重み付き行動クローニング（HABC）を提案する。これは、異なるデータサブセットに対してこれら二つの目的のための別々の批評家ヘッドを訓練し、その出力を状態適応型バランスで結合する。状態適応型ゲートg_tは、それらの一段階アドバンテージを統合し、成功が不確かな場合は生存性を優先し、生存性が高い場合にのみ効率性へと移行し、その結果をアクター損失に対する各遷移の重みに変換する。介入認識型クレジット割り当ては、結果ラベルを現在の方策によって実行されたセグメントにさらに制限することで、介入境界を越えた教師信号の漏洩を防ぐ。3つの接触を伴う両腕協調タスクにおける実ロボット実験では、HABCは教師あり微調整（SFT）ベースラインの36%、44%、12%から、それぞれ92%、88%、38%へと成功率を向上させた。

English

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.