学习用于动作分块流程策略的本土延续性

摘要

动作分块技术使视觉语言动作模型能够实时运行，但朴素的分块执行常在区块边界处出现不连续性。实时分块方法虽能缓解此问题，但因外置于策略网络，会导致伪多模态切换及非本质平滑的运动轨迹。我们提出Legato——一种面向基于流的动作分块VLA策略的训练时延续方法。该方法通过从已知动作与噪声按调度形混合的初始状态启动去噪过程，使模型接触部分动作信息。同时，Legato重构学习到的流动力学，确保在逐步指导下的去噪过程在训练与推断间保持一致性。通过训练时采用随机化调度条件，该方法可适应不同的推断延迟并实现可控平滑度。实验表明，Legato能生成更平滑的运动轨迹，减少执行时的伪多模态切换，从而降低犹豫时间并缩短任务完成时长。大量实体实验证明，在五项操作任务中Legato均稳定优于实时分块方法，轨迹平滑度与任务完成时间均提升约10%。

English

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

学习用于动作分块流程策略的本土延续性

Learning Native Continuation for Action Chunking Flow Policies

摘要

Support