ChatPaper.aiChatPaper

離散擴散VLA:將離散擴散引入視覺-語言-動作策略中的行動解碼

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

August 27, 2025
作者: Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo
cs.AI

摘要

視覺-語言-動作(VLA)模型通過調整大型視覺-語言骨幹,將圖像和指令映射到機器人動作。然而,現有的VLA解碼器要麼以固定的從左到右順序自回歸生成動作,要麼在骨幹外部附加連續擴散或流匹配頭,這需要專門的訓練和迭代採樣,阻礙了統一且可擴展的架構。我們提出了離散擴散VLA,這是一種單一變壓器策略,通過離散擴散對離散化的動作塊進行建模,並使用與VLM骨幹相同的交叉熵目標進行訓練。該設計保留了擴散的漸進細化範式,同時與VLM的離散令牌接口保持原生兼容。我們的方法實現了自適應解碼順序,先解決簡單的動作元素,再處理較難的,並通過次級重掩碼在細化輪次中重新審視不確定的預測,從而提高一致性並實現穩健的錯誤修正。這種統一的解碼器保留了預訓練的視覺語言先驗,支持並行解碼,打破了自回歸瓶頸,並減少了函數評估次數。離散擴散VLA在LIBERO上達到了96.3%的平均成功率,在SimplerEnv Fractal上達到了71.2%的視覺匹配率,在SimplerEnv Bridge上達到了49.3%的總體表現,優於自回歸和連續擴散的基線。這些發現表明,離散擴散動作解碼器支持精確的動作建模和一致的訓練,為將VLA擴展到更大的模型和數據集奠定了基礎。
English
Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions to robot actions. However, prevailing VLA decoders either generate actions autoregressively in a fixed left-to-right order or attach continuous diffusion or flow matching heads outside the backbone, demanding specialized training and iterative sampling that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a single-transformer policy that models discretized action chunks with discrete diffusion and is trained with the same cross-entropy objective as the VLM backbone. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary remasking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pretrained vision language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO, 71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv Bridge, improving over both autoregressive and continuous diffusion baselines. These findings indicate that discrete-diffusion action decoder supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets.
PDF263August 28, 2025