在潜在空间中学习高频连续动作片段

摘要

现代机器人策略越来越依赖于动作分块来执行物理世界中的复杂任务。虽然动作分块在中等动作频率下能提升时间一致性，但当动作频率进一步提高（例如达到60赫兹）时，该方法变得不足。在此类高频下，策略往往难以生成既时间平滑又空间一致的动作。我们通过将高频动作学习从动作空间转移到带有变分自编码器（VAE）的潜在空间来解决这一挑战。这一方案显著提升了高频控制的时间与空间一致性。为实现流畅的实时执行，我们进一步引入了"复用-再精炼"（Reuse-then-Refine），一种基于分块级别的精炼策略，用于改善异步推理下相邻动作分块之间的连续性。由此，受我们策略控制的机器人能够以更少的停顿和抖动，持续执行复杂的接触密集型任务。在三个真实世界的接触密集型机器人任务上的实验表明，我们的方法能始终以平滑的动作完成任务。我们的代码和数据可在https://github.com/tars-robotics/RTR获取。

English

Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting high-frequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions. Our code and data are available at https://github.com/tars-robotics/RTR.