TBD-VLA：時間塊擴散視覺語言動作模型

摘要

離散的視覺-語言-動作（Vision-Language-Action, VLA）模型通常將動作生成表述為在離散化動作空間上的下一個標記預測，並以自回歸方式將每個標記條件化於先前的上下文。儘管這種方法有效，但其推理延遲較高，且在很大程度上忽略了動作軌跡中固有的時間結構。近期的研究引入了並行解碼以提高效率，實現更快的推理，但缺乏對標記依賴關係進行顯式建模的機制。我們提出 TBD-VLA，這是一個基於離散標記的 VLA 框架，通過引入區塊擴散來實現時間動作生成。我們將動作序列劃分為時間區塊，並在每個區塊內進行遮罩離散擴散，同時在區塊之間保持自回歸生成。這種設計統一了時間自回歸與並行動作解碼，同時實現了強大的時間一致性和更快的推理速度。此外，顯式的時間建模還能通過時間修補（temporal in-painting）實現動作塊（例如即時分塊）的非同步執行。TBD-VLA 在模擬和真實世界的操作任務中均顯著優於先前的 VLA 方法，為快速且具時間感知能力的離散 VLA 模型提供了一條可擴展的途徑。專案網頁：https://tbd-vla.github.io/

English

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/