TBD-VLA：时间块扩散视觉语言动作模型

摘要

离散视觉-语言-动作（VLA）模型通常将动作生成建模为在离散动作空间上的下一个token预测，即每个token的生成基于先前上下文进行自回归条件计算。尽管有效，但该范式推理延迟高，且在很大程度上忽略了动作轨迹中固有的时间结构。近期研究引入并行解码以提升效率、实现更快的推理，但缺乏对token依赖关系的显式建模。我们提出TBD-VLA——一种基于离散token的VLA框架，通过引入块扩散实现时间动作生成。我们将动作序列划分为时间块，在每个块内进行掩蔽离散扩散，同时保持块间的自回归生成。该设计统一了时间自回归与并行动作解码，在实现强时间一致性的同时提升了推理速度。此外，显式的时间建模使得动作块（例如实时分块）能够通过时间补全实现异步执行。TBD-VLA在模拟和真实世界的操控任务中均显著优于先前的VLA方法，为构建快速、具备时间感知能力的离散VLA模型提供了可扩展的路径。项目网页：https://tbd-vla.github.io/

English

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/