이산 확산 VLA: 비전-언어-행동 정책에서 행동 디코딩에 이산 확산 도입

초록

비전-언어-행동(Vision-Language-Action, VLA) 모델은 대규모 비전-언어 백본을 조정하여 이미지와 명령을 로봇 행동으로 매핑합니다. 그러나 현재의 VLA 디코더는 고정된 좌측에서 우측으로의 순서로 자동회귀적으로 행동을 생성하거나, 백본 외부에 연속적인 확산(diffusion) 또는 플로우 매칭(flow matching) 헤드를 부착하여 특수한 훈련과 반복적 샘플링을 요구함으로써 통일적이고 확장 가능한 아키텍처 구축을 방해합니다. 본 논문에서는 이산 확산(Discrete Diffusion) VLA를 제안합니다. 이는 이산 확산을 통해 이산화된 행동 청크를 모델링하고 VLM 백본과 동일한 교차 엔트로피 목적 함수로 훈련되는 단일 트랜스포머 정책입니다. 이 설계는 확산의 점진적 정제 패러다임을 유지하면서도 VLM의 이산 토큰 인터페이스와 자연스럽게 호환됩니다. 우리의 방법은 쉬운 행동 요소를 먼저 해결하고 어려운 요소를 나중에 처리하는 적응형 디코딩 순서를 구현하며, 정제 라운드 동안 불확실한 예측을 재검토하기 위해 보조 리마스킹(remasking)을 사용함으로써 일관성을 향상시키고 강력한 오류 수정을 가능하게 합니다. 이 통합 디코더는 사전 훈련된 비전-언어 사전 지식을 보존하고 병렬 디코딩을 지원하며, 자동회귀적 병목 현상을 해결하고 함수 평가 횟수를 줄입니다. 이산 확산 VLA는 LIBERO에서 96.3%의 평균 성공률(SR), SimplerEnv Fractal에서 71.2%의 시각적 매칭, SimplerEnv Bridge에서 49.3%의 전반적 성능을 달성하여 자동회귀 및 연속 확산 기반 모델을 모두 능가합니다. 이러한 결과는 이산 확산 행동 디코더가 정확한 행동 모델링과 일관된 훈련을 지원하며, VLA를 더 큰 모델과 데이터셋으로 확장하기 위한 기반을 마련함을 시사합니다.

English

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions to robot actions. However, prevailing VLA decoders either generate actions autoregressively in a fixed left-to-right order or attach continuous diffusion or flow matching heads outside the backbone, demanding specialized training and iterative sampling that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a single-transformer policy that models discretized action chunks with discrete diffusion and is trained with the same cross-entropy objective as the VLM backbone. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary remasking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pretrained vision language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO, 71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv Bridge, improving over both autoregressive and continuous diffusion baselines. These findings indicate that discrete-diffusion action decoder supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets.

이산 확산 VLA: 비전-언어-행동 정책에서 행동 디코딩에 이산 확산 도입

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

초록

Support