潜在空間における高頻度連続行動チャンクの学習

要旨

現代のロボットポリシーは、実世界で複雑なタスクを実行するためにアクションチャンキングにますます依存している。アクションチャンキングは中程度の動作周波数では時間的一貫性を向上させるが、動作周波数をさらに高めると（例：60 Hz）、不十分となる。このような高周波数では、ポリシーは時間的に滑らかで空間的に一貫した動作を生成できなくなることが多い。本稿では、高周波動作学習を動作空間から変分オートエンコーダ（VAE）を用いた潜在空間へ移行することで、この課題に取り組む。この定式化により、高周波制御の時間的一貫性と空間的一貫性が大幅に向上する。さらに、円滑なリアルタイム実行を実現するため、非同期推論下で隣接するアクションチャンク間の連続性を改善するチャンクレベルの改良戦略であるReuse-then-Refineを導入する。その結果、本ポリシーによって制御されるロボットは、停止やぎこちない動作を減らし、複雑な接触を伴うタスクを連続的に実行できる。実世界における3つの接触リッチなロボットタスクの実験により、本手法が滑らかな動作で一貫してタスクを完了することを示す。コードとデータは https://github.com/tars-robotics/RTR で公開している。

English

Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting high-frequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions. Our code and data are available at https://github.com/tars-robotics/RTR.