희소 에피소드 결과 기반 VLA의 온라인 RL 미세 조정을 위한 계층적 이점 가중치

초록

사전 훈련된 VLA 정책을 온라인 강화학습을 통해 미세 조정할 때, 각 롤아웃 에피소드는 단일 이진 결과(성공 또는 실패)만을 생성하지만, 행동자 업데이트는 전이별 지도를 필요로 한다. 기존 접근법은 일반적으로 이러한 희소 결과를 단일 스칼라 보상 또는 이점 신호로 축소하는데, 이는 서로 다른 형태의 전이 수준 피드백을 혼동하고 기본적인 작업 성공이 달성 가능해진 후에는 제한적인 지침만 제공한다. 첫째, 단일 스칼라 신호는 생존 가능성과 효율성이라는 두 가지 목표를 혼동한다. 기본 성공이 달성되면 이진 레이블은 효율적인 완료와 느린 완료를 구별할 기울기를 제공하지 않는다. 둘째, 실제 롤아웃은 자율 구간과 개입 구간을 혼합한다. 이러한 경계를 넘어 에피소드 결과를 단순히 할당하면 잘못된 신용 할당이 발생한다. 이러한 문제를 해결하기 위해, 우리는 계층적 이점 가중 행동 복제(HABC)를 제안한다. 이 방법은 서로 다른 데이터 하위 집합에 대해 이 두 목표를 위한 별도의 비평가 헤드를 훈련하고, 상태 적응 균형으로 그 출력을 결합한다. 상태 적응 게이트 \( g_t \)는 이들의 단일 단계 이점을 병합하여 성공이 불확실할 때는 생존 가능성을 우선시하고 생존 가능성이 높을 때만 효율성으로 전환하며, 결과를 행동자 손실에 대한 전이별 가중치로 변환한다. 개입 인식 신용 할당은 결과 레이블을 현재 정책에 의해 실행된 구간으로 추가로 제한하여 개입 경계를 넘어 지도가 누출되는 것을 방지한다. 접촉이 많은 세 가지 양손 작업에 대한 실제 로봇 실험에서 HABC는 지도 미세 조정(SFT) 기준선의 36%, 44%, 12%에서 각각 92%, 88%, 38%로 성공률을 향상시킨다.

English

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate g_t merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.