잠재 공간에서의 고주파 연속 동작 청크 학습

초록

현대 로봇 정책은 물리적 세계에서 복잡한 작업을 실행하기 위해 점점 더 동작 청킹(action chunking)에 의존하고 있다. 동작 청킹은 중간 정도의 동작 주파수에서 시간적 일관성을 향상시키지만, 동작 주파수가 더욱 증가하면(예: 60Hz) 충분하지 않다. 이러한 높은 주파수에서 정책은 종종 시간적으로 매끄럽고 공간적으로 일관된 동작을 생성하지 못한다. 우리는 고주파 동작 학습을 동작 공간에서 변분 오토인코더(VAE)를 사용한 잠재 공간으로 전환함으로써 이 문제를 해결한다. 이 공식은 고주파 제어의 시간적 및 공간적 일관성을 크게 향상시킨다. 부드러운 실시간 실행을 가능하게 하기 위해, 우리는 비동기 추론에서 인접한 동작 청크 간의 연속성을 개선하는 청크 수준의 정제 전략인 Reuse-then-Refine을 추가로 도입한다. 그 결과, 우리의 정책으로 제어되는 로봇은 복잡한 접촉이 많은 작업을 중단이나 불규칙한 움직임 없이 연속적으로 실행할 수 있다. 세 가지 실제 접촉이 많은 로봇 작업에 대한 실험은 우리의 접근 방식이 매끄러운 동작으로 작업을 일관되게 완료함을 보여준다. 코드와 데이터는 https://github.com/tars-robotics/RTR 에서 확인할 수 있다.

English

Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting high-frequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions. Our code and data are available at https://github.com/tars-robotics/RTR.