ChatPaper.aiChatPaper

离散扩散VLA:将离散扩散引入视觉-语言-动作策略中的动作解码

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

August 27, 2025
作者: Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo
cs.AI

摘要

视觉-语言-动作(VLA)模型通过调整大型视觉语言主干网络,将图像和指令映射为机器人动作。然而,现有的VLA解码器要么以固定的从左到右顺序自回归生成动作,要么在主干网络外附加连续的扩散或流匹配头,这需要专门的训练和迭代采样,阻碍了统一、可扩展架构的实现。我们提出了离散扩散VLA,这是一种单一Transformer策略,通过离散扩散对离散化的动作块进行建模,并采用与VLM主干相同的交叉熵目标进行训练。该设计保留了扩散的渐进细化范式,同时与VLM的离散令牌接口保持原生兼容。我们的方法实现了自适应解码顺序,先解决简单的动作元素再处理较难的,并通过二次重掩码在细化轮次中重新审视不确定的预测,从而提升一致性并实现稳健的错误纠正。这一统一解码器保留了预训练的视觉语言先验,支持并行解码,突破了自回归瓶颈,并减少了函数评估次数。离散扩散VLA在LIBERO上实现了96.3%的平均成功率,在SimplerEnv Fractal上达到71.2%的视觉匹配率,在SimplerEnv Bridge上整体表现达到49.3%,优于自回归和连续扩散基线。这些发现表明,离散扩散动作解码器支持精确的动作建模和一致的训练,为将VLA扩展到更大模型和数据集奠定了基础。
English
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions to robot actions. However, prevailing VLA decoders either generate actions autoregressively in a fixed left-to-right order or attach continuous diffusion or flow matching heads outside the backbone, demanding specialized training and iterative sampling that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a single-transformer policy that models discretized action chunks with discrete diffusion and is trained with the same cross-entropy objective as the VLM backbone. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary remasking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pretrained vision language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO, 71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv Bridge, improving over both autoregressive and continuous diffusion baselines. These findings indicate that discrete-diffusion action decoder supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets.
PDF253August 28, 2025