ChatPaper.aiChatPaper

VLASH:基于未来状态感知的异步推理实现实时可变长度注意力机制

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

November 30, 2025
作者: Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, Song Han
cs.AI

摘要

视觉-语言-动作模型(VLA)在多样化机器人任务中正展现出日益强大的能力。然而,其实际部署仍存在速度慢、效率低的问题:演示视频常需加速5-10倍才能呈现流畅效果,且存在明显动作卡顿及对环境变化的延迟响应。异步推理通过让机器人在执行动作的同时进行推理计算,为实现连续低延迟控制提供了可行方案。但由于推理过程中机器人与环境持续演变,预测区间与执行区间会产生时序错位,导致显著的动作不稳定性。现有方法或降低精度或引入运行时开销以缓解该问题。我们提出VLASH——一种通用VLA异步推理框架,无需额外开销或架构改动即可实现平滑、精准、快速的反应控制。该框架通过将机器人状态与先前生成的动作片段进行前向推演,预估未来执行时刻的状态,从而弥合预测与执行间的鸿沟。实验表明,相较于同步推理,VLASH在完全保持原始精度的同时,最高可实现2.03倍加速,并将反应延迟降低达17.4倍。此外,它使VLA能够胜任乒乓球对打、打地鼠等需要快速反应的高精度任务,而传统同步推理在此类任务中均告失败。代码已开源:https://github.com/mit-han-lab/vlash
English
Vision-Language-Action models (VLAs) are becoming increasingly capable across diverse robotic tasks. However, their real-world deployment remains slow and inefficient: demonstration videos are often sped up by 5-10x to appear smooth, with noticeable action stalls and delayed reactions to environmental changes. Asynchronous inference offers a promising solution to achieve continuous and low-latency control by enabling robots to execute actions and perform inference simultaneously. However, because the robot and environment continue to evolve during inference, a temporal misalignment arises between the prediction and execution intervals. This leads to significant action instability, while existing methods either degrade accuracy or introduce runtime overhead to mitigate it. We propose VLASH, a general asynchronous inference framework for VLAs that delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes. VLASH estimates the future execution-time state by rolling the robot state forward with the previously generated action chunk, thereby bridging the gap between prediction and execution. Experiments show that VLASH achieves up to 2.03x speedup and reduces reaction latency by up to 17.4x compared to synchronous inference while fully preserving the original accuracy. Moreover, it empowers VLAs to handle fast-reaction, high-precision tasks such as playing ping-pong and playing whack-a-mole, where traditional synchronous inference fails. Code is available at https://github.com/mit-han-lab/vlash
PDF170December 3, 2025