PoLAR: 潜在行動における範囲とモードの因子分解によるロボットポリシー学習

要旨

潜在動作事前学習は観測ペアから視覚的変化の表現を学習するが、既存手法は各遷移を単一の非構造化表現として符号化し、遷移の範囲と遷移モードが混在してしまう。本稿では放射構造を持つ極座標潜在動作（PoLAR）を導入し、潜在動作に動径方向の構造を課すことで、半径に遷移範囲を、方向に遷移モードを保持させる。PoLARは二つの観測間の時間差を遷移範囲の弱い代理指標として用い、時間的ギャップが大きい観測ペアから得られる潜在動作ほど大きな半径を占めるよう促す。この構造を双曲空間で具体化する。双曲空間は半径の増加に伴い体積が拡大するため、より多様な遷移モードを大きな範囲で自然に表現できる。タスク内設定と大規模事前学習設定の両方において、PoLARはシミュレーションおよび実世界のロボット実験で下流ポリシーのパフォーマンスを向上させ、潜在動作のベースラインや強力な事前学習済みVLAを上回った。これらの結果は、潜在動作空間の幾何構造が、視覚的事前学習を下流のロボットポリシー学習に転移する上で重要な設計選択であることを示唆している。

English

Latent action pretraining learns representations of visual change from pairs of observations, but existing methods typically encode each transition as a single unstructured representation that entangles transition extent and transition mode. We introduce Polar Latent Actions with Radial structure (PoLAR), which imposes a radial-direction structure on latent actions, encouraging radius to encode transition extent and direction to retain transition mode. PoLAR uses temporal offset between two observations as a weak proxy for transition extent, encouraging latent action from observation pairs separated by larger temporal gaps to occupy larger radii. We instantiate this structure in hyperbolic space, whose expanding volume with radius offers a natural fit for more diverse transition modes at larger extents. Across in-task and large-scale pretraining settings, PoLAR improves downstream policy performance in simulation and real-world robot experiments, outperforming latent action baselines and strong pretrained VLAs. These results suggest that the geometry of the latent action space is an important design choice for transferring visual pretraining to downstream robot policy learning.