基于稀疏回合结果的VLA模型在线强化学习微调的层级优势加权

摘要

当预训练的视觉-语言-动作（VLA）策略通过在线强化学习进行微调时，每次交互轨迹只产生一个二元结果（成功或失败），但动作网络的更新需要每个转移步骤的监督信号。现有方法通常将这种稀疏结果简化为单一的标量奖励或优势值，这混淆了不同形式的转移级别反馈，且一旦基本任务成功变得可实现时，所提供的指导作用便十分有限。首先，单一标量信号混淆了可行性和效率这两个目标；一旦基本成功达成，二元标签无法提供梯度来区分高效完成与缓慢完成。其次，真实世界的交互轨迹混合了自主执行段和人工干预段；简单地将整个轨迹的结局标签跨这些边界分配会导致错误的信度分配。为解决这些问题，我们提出分层优势加权行为克隆（HABC），该方法针对这两个目标在不同数据子集上训练独立的评价网络分支，并通过状态自适应平衡机制融合其输出。一个状态自适应门控变量g_t将两者的一步优势值合并：在成功不确定时优先考虑可行性，仅在可行性高时才转向效率目标，并将结果转换为动作网络损失函数中每个转移步骤的权重。干预感知的信度分配进一步将结局标签限制在当前策略执行的片段内，防止监督信号跨越干预边界泄露。在三个接触密集的双臂操作任务的实际机器人实验中，HABC将监督微调（SFT）基线的成功率从36%、44%与12%分别提升至92%、88%与38%。

English

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.