MMaDA-VLA：統一マルチモーダル命令・生成を備えた大規模拡散視覚言語行動モデル

要旨

Vision-Language-Action（VLA）モデルは、視覚観測と自然言語指示からロボットのマニピュレーションを制御することを目的としている。しかし、既存の階層的および自己回帰的パラダイムは、しばしば構造的なオーバーヘッドを導入し、時間的な不整合や長期的な誤差蓄積に悩まされ、追加モジュールなしで環境ダイナミクスを捕捉するメカニズムを欠いている。この問題に対処するため、我々は、マルチモーダルな理解と生成を単一フレームワークに統合した、完全にネイティブな事前学習済み大規模拡散VLAモデルであるMMaDA-VLAを提案する。我々の重要なアイデアは、言語、画像、連続的なロボット制御を一つの離散トークン空間に埋め込み、マスクトークン復調により単一のバックボーンを訓練して、将来の目標観測とアクション塊を並列に共同生成する、ネイティブな離散拡散定式化である。反復的な復調は、グローバルで順序依存性のない洗練を可能にし、補助的な世界モデルなしに予測された将来の視覚的結果に基づいて行動を接地させながら、長期的な一貫性を改善する。シミュレーションベンチマークと実世界タスクにおける実験は、LIBEROで98.0%の平均成功率、CALVINで4.78の平均長を達成し、最先端の性能を示している。

English

Vision-Language-Action (VLA) models aim to control robots for manipulation from visual observations and natural-language instructions. However, existing hierarchical and autoregressive paradigms often introduce architectural overhead, suffer from temporal inconsistency and long-horizon error accumulation, and lack a mechanism to capture environment dynamics without extra modules. To this end, we present MMaDA-VLA, a fully native pre-trained large diffusion VLA model that unifies multi-modal understanding and generation in a single framework. Our key idea is a native discrete diffusion formulation that embeds language, images, and continuous robot controls into one discrete token space and trains a single backbone with masked token denoising to jointly generate a future goal observation and an action chunk in parallel. Iterative denoising enables global, order-free refinement, improving long-horizon consistency while grounding actions in predicted future visual outcomes without auxiliary world models. Experiments across simulation benchmarks and real-world tasks show state-of-the-art performance, achieving 98.0% average success on LIBERO and 4.78 average length on CALVIN.

MMaDA-VLA：統一マルチモーダル命令・生成を備えた大規模拡散視覚言語行動モデル

MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

要旨

Support