PoLAR：机器人策略学习中潜在动作的幅度与模式分解

摘要

潜在动作预训练通过学习观察对之间的视觉变化表征，但现有方法通常将每次状态转移编码为单一的非结构化表征，导致转移程度与转移模式相互纠缠。我们提出具有径向结构的极坐标潜在动作（PoLAR），在潜在动作上施加径向方向结构，促使半径编码转移程度，方向保留转移模式。PoLAR利用两个观测之间的时间间隔作为转移程度的弱代理信号，促使时间间隔更大的观测对对应的潜在动作占据更大的半径。我们在双曲空间中实例化该结构，其随半径扩张的体积天然适配更大转移程度下更丰富的转移模式。在任务内和大规模预训练设定中，PoLAR提升了仿真和真实机器人实验的下游策略性能，优于潜在动作基线方法及强预训练视觉语言动作模型。这些结果表明，潜在动作空间的几何结构是将视觉预训练迁移至下游机器人策略学习的重要设计选择。

English

Latent action pretraining learns representations of visual change from pairs of observations, but existing methods typically encode each transition as a single unstructured representation that entangles transition extent and transition mode. We introduce Polar Latent Actions with Radial structure (PoLAR), which imposes a radial-direction structure on latent actions, encouraging radius to encode transition extent and direction to retain transition mode. PoLAR uses temporal offset between two observations as a weak proxy for transition extent, encouraging latent action from observation pairs separated by larger temporal gaps to occupy larger radii. We instantiate this structure in hyperbolic space, whose expanding volume with radius offers a natural fit for more diverse transition modes at larger extents. Across in-task and large-scale pretraining settings, PoLAR improves downstream policy performance in simulation and real-world robot experiments, outperforming latent action baselines and strong pretrained VLAs. These results suggest that the geometry of the latent action space is an important design choice for transferring visual pretraining to downstream robot policy learning.