在潛在空間中學習高頻連續動作片段

摘要

現代機器人策略日益依賴於動作分塊（action chunking）來執行現實世界中的複雜任務。雖然動作分塊在中等動作頻率下能提升時間一致性，但當動作頻率進一步提高（例如至60~Hz）時，其效能便顯不足。在此類高頻條件下，策略往往無法生成既時間平滑又空間一致的動作。我們通過將高頻動作學習從動作空間轉移至具有變分自編碼器（VAE）的潛在空間來應對此挑戰。此方法顯著提升了高頻控制中的時間與空間一致性。為實現流暢的即時執行，我們進一步引入「複用-再精煉」（Reuse-then-Refine）策略，這是一種在非同步推理下改進相鄰動作分塊間連續性的分塊級精煉方法。因此，由我們策略控制的機械人能持續執行複雜的接觸密集任務，減少停頓與顫抖動作。在三項真實世界接觸密集型機器人任務上的實驗顯示，我們的方法能以平滑動作一致完成任務。我們的程式碼與數據已於 https://github.com/tars-robotics/RTR 公開。

English

Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting high-frequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions. Our code and data are available at https://github.com/tars-robotics/RTR.