MMaDA-VLA: 통합 다중 모달 명령 및 생성을 지원하는 대규모 확산 비전-언어-행동 모델

초록

Vision-Language-Action(VLA) 모델은 시각 관측과 자연어 명령을 통해 로봇의 조작을 제어하는 것을 목표로 합니다. 그러나 기존의 계층적 및 자기회귀적 패러다임은 종종 구조적 오버헤드를 초래하고, 시간적 불일치와 장기간 오류 누적 문제를 겪으며, 추가 모듈 없이 환경 역학을 포착하는 메커니즘이 부족합니다. 이를 위해 본 논문에서는 다중 모달 이해와 생성을 단일 프레임워크로 통합하는 완전 네이티브 사전 학습된 대규모 확산 VLA 모델인 MMaDA-VLA를 제안합니다. 우리의 핵심 아이디어는 언어, 이미지, 연속 로봇 제어를 하나의 이산 토큰 공간에 임베딩하고, 마스킹된 토큰 노이즈 제거를 통해 단일 백본을 학습하여 미래 목표 관측과 액션 청크를 병렬로 함께 생성하는 네이티브 이산 확산 공식화입니다. 반복적 노이즈 제거는 전역적이고 순서 무관한 정제를 가능하게 하여 보조 세계 모델 없이 예측된 미래 시각 결과에 기반한 액션을 구축하면서 장기간 일관성을 향상시킵니다. 시뮬레이션 벤치마크와 실제 작업에 대한 실험 결과, LIBERO에서 98.0%의 평균 성공률과 CALVIN에서 4.78의 평균 길이를 달성하여 최첨단 성능을 보여주었습니다.

English

Vision-Language-Action (VLA) models aim to control robots for manipulation from visual observations and natural-language instructions. However, existing hierarchical and autoregressive paradigms often introduce architectural overhead, suffer from temporal inconsistency and long-horizon error accumulation, and lack a mechanism to capture environment dynamics without extra modules. To this end, we present MMaDA-VLA, a fully native pre-trained large diffusion VLA model that unifies multi-modal understanding and generation in a single framework. Our key idea is a native discrete diffusion formulation that embeds language, images, and continuous robot controls into one discrete token space and trains a single backbone with masked token denoising to jointly generate a future goal observation and an action chunk in parallel. Iterative denoising enables global, order-free refinement, improving long-horizon consistency while grounding actions in predicted future visual outcomes without auxiliary world models. Experiments across simulation benchmarks and real-world tasks show state-of-the-art performance, achieving 98.0% average success on LIBERO and 4.78 average length on CALVIN.

MMaDA-VLA: 통합 다중 모달 명령 및 생성을 지원하는 대규모 확산 비전-언어-행동 모델

MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

초록

Support