統合型視覚-言語-行動モデル

要旨

視覚-言語-行動モデル（VLA）は、ロボット操作の進展における可能性から大きな注目を集めている。しかし、従来のアプローチは主に視覚-言語モデル（VLM）の一般的な理解能力に依存して行動信号を生成しており、視覚観測に埋め込まれた豊かな時間的および因果的構造を見落とすことが多かった。本論文では、UniVLAを紹介する。これは、視覚、言語、行動信号を離散トークンシーケンスとして自己回帰的にモデル化する統一されたネイティブなマルチモーダルVLAモデルである。この定式化により、特に大規模なビデオデータからの柔軟なマルチモーダルタスク学習が可能となる。ポストトレーニング中に世界モデリングを組み込むことで、UniVLAはビデオから因果的ダイナミクスを捉え、下流のポリシー学習、特に長期タスクへの効果的な転移を促進する。我々のアプローチは、CALVIN、LIBERO、Simplenv-Bridgeなど、広く使用されているシミュレーションベンチマークにおいて新たな最先端の結果を達成し、従来の手法を大幅に上回った。例えば、UniVLAはLIBEROベンチマークで95.5%の平均成功率を達成し、pi0-FASTの85.5%を上回った。さらに、現実世界のALOHA操作や自動運転における幅広い適用性も実証した。

English

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.

統合型視覚-言語-行動モデル

Unified Vision-Language-Action Model

要旨

Support