從稀疏回合結果對VLA進行線上強化學習微調的分層優勢加權

摘要

當預訓練的視覺-語言-動作（VLA）策略透過線上強化學習進行微調時，每次推出的回合僅產生單一二元結果（成功或失敗），然而策略更新卻需要每個時間步的監督訊號。現有方法通常將這種稀疏結果簡化為單一標量獎勵或優勢訊號，此舉混淆了不同形式的时间步層級反饋，且一旦基本任務成功達成，所能提供的引導便十分有限。首先，單一標量訊號混淆了可行性與效率這兩個目標；一旦基本成功達成，二元標籤便無法提供梯度來區分高效完成與緩慢完成的執行。其次，真實世界中的推出混合了自主與干預片段；天真地將回合結果跨這些邊界進行分配會導致錯誤的信用分配。為解決這些問題，我們提出分層優勢加權行為複製（HABC），該方法針對這兩個目標在不同數據子集上訓練獨立的評論家頭，並透過狀態自適應平衡將其輸出結合。狀態自適應閘門 g_t 合併其單步優勢，當成功不確定時優先考慮可行性，僅在可行性高時轉向效率，並將結果轉換為策略損失上的每個時間步權重。干預感知的信用分配進一步將結果標籤限制於由當前策略執行的片段，防止監督訊號洩漏至干預邊界之外。在三項高接觸雙機械臂任務的真實機器人實驗中，HABC 將監督微調基線的 36%、44% 和 12% 成功率提升至 92%、88% 和 38%。

English

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.