TBD-VLA: 時間ブロック拡散視覚言語行動モデル

要旨

離散視覚-言語-行動（VLA）モデルは通常、離散化された行動空間における次トークン予測として行動生成を定式化し、各トークンを先行コンテキストに基づいて自己回帰的に条件付けます。このパラダイムは効果的である一方、高い推論遅延を引き起こし、行動軌跡に内在する時間構造をほとんど考慮しません。近年の取り組みでは、効率向上のために並列デコードを導入し、より高速な推論を実現していますが、トークン間の依存関係を明示的にモデル化する機構は欠如しています。本稿では、ブロック拡散を導入して時間的行動生成を可能にする、離散トークンベースのVLAフレームワークであるTBD-VLAを提案します。行動系列を時間ブロックに分割し、各ブロック内でマスク離散拡散を実行する一方、ブロック間では自己回帰生成を維持します。この設計により、時間的自己回帰と並列行動デコードが統合され、強い時間的一貫性と改善された推論速度の両方を実現します。さらに、明示的な時間モデリングにより、時間的インペインティングを介した行動チャンクの非同期実行（例：リアルタイムチャンキング）が可能になります。TBD-VLAは、シミュレーションおよび実世界の操作タスクの両方において従来のVLA手法を大幅に上回り、高速で時間認識可能な離散VLAモデルへの拡張可能な道筋を提供します。プロジェクトWebページ：https://tbd-vla.github.io/

English

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/