FASTER：重新思考实时流式视觉语言助手

摘要

实时执行对于将视觉-语言-动作模型部署到物理世界至关重要。现有异步推理方法主要优化轨迹平滑度，却忽视了响应环境变化的关键延迟问题。本文通过重新思考动作分块策略中的响应机制，系统分析了影响响应时间的核心因素。我们发现响应时间遵循由首动作响应时间与执行视野共同决定的均匀分布。进一步揭示出：基于流式的VLA模型采用恒定调度策略会导致效率低下，迫使系统完成所有采样步骤后才能开始运动，这构成了响应延迟的瓶颈。为突破此限制，我们提出即时响应快速动作采样方法。通过引入视野感知调度机制，FASTER在流式采样过程中自适应地优先处理近期动作，将即时响应的去噪过程压缩十倍（如在π_{0.5}和X-VLA中）至单步完成，同时保持长视野轨迹质量。结合流式客户端-服务器流水线架构，FASTER在真实机器人上显著降低了有效响应延迟，尤其在消费级GPU部署场景中。包括高动态乒乓球任务在内的实景实验证明，FASTER为通用策略开启了前所未有的实时响应能力，能够快速生成精准平滑的运动轨迹。

English

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in π_{0.5} and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.