デュアルストリーム拡散による世界モデル拡張型視覚言語行動モデル

要旨

近年、ワールドモデリングを組み込んだVision-Language-Actionモデル（VLA）の拡張が、ロボット政策学習の改善において有望視されている。しかし、観測と行動という異なるモダリティ間の本質的な差異から、次の状態観測と行動系列を同時に予測することは依然として困難である。この課題に対処するため、我々はモダリティ間の衝突を処理し、多様なタスクにおけるVLAの性能を向上させるワールドモデル拡張VLAフレームワーク「DUal-STream diffusion（DUST）」を提案する。具体的には、明示的に分離されたモダリティストリームを維持しつつ、クロスモーダルな知識共有を可能とするマルチモーダル拡散トランスフォーマーアーキテクチャを設計した。さらに、各モダリティに独立したノイズ摂動と、分離型フローマッチング損失を導入する。この設計により、統合された潜在空間を必要とせず、双方向的な手法で結合分布を学習することが可能となる。訓練時のモダリティ分離に基づき、行動トークンと視覚トークンが異なる速度で非同期に進化するテスト時スケーリングをサポートする共同サンプリング手法も提案する。RoboCasaやGR-1などのシミュレーションベンチマークにおける実験を通じて、DUSTはベースライン手法に対し最大6%の性能向上を達成し、テスト時スケーリング手法によりさらに2-5%の向上が得られることを示した。Franka Research 3を用いた実世界タスクでは、DUSTは成功率を13%向上させ、シミュレーションを超えた有効性を確認した。さらに、BridgeV2の行動非依存ビデオによる事前学習は、RoboCasaにおいて顕著な転移効果をもたらし、大規模VLA事前学習におけるDUSTの可能性を強調する結果となった。

English

Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.

デュアルストリーム拡散による世界モデル拡張型視覚言語行動モデル

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

要旨

Support