PoLAR: 로봇 정책 학습을 위한 잠재 행동의 정도와 모드 분해

초록

잠재 행동 사전 학습은 관찰 쌍으로부터 시각적 변화의 표현을 학습하지만, 기존 방법들은 일반적으로 각 전환을 전환 정도와 전환 모드를 혼합하는 단일 비구조적 표현으로 인코딩한다. 본 연구는 방사 구조를 갖는 극좌표 잠재 행동(PoLAR)을 도입하여 잠재 행동에 방사 방향 구조를 부과함으로써, 반지름이 전환 정도를, 방향이 전환 모드를 인코딩하도록 유도한다. PoLAR는 두 관찰 간 시간적 차이를 전환 정도에 대한 약한 대리 변수로 사용하여, 더 큰 시간 간격으로 분리된 관찰 쌍의 잠재 행동이 더 큰 반지름을 차지하도록 장려한다. 이러한 구조를 쌍곡 공간에서 구현하는데, 이 공간은 반지름에 따라 부피가 팽창하는 특성을 가지므로 더 큰 전환 정도에서 더 다양한 전환 모드를 수용하는 데 자연스럽게 적합하다. PoLAR는 작업 내 사전 학습 및 대규모 사전 학습 설정 모두에서 시뮬레이션 및 실제 로봇 실험의 하위 정책 성능을 개선하며, 잠재 행동 기준 모델 및 강력한 사전 학습된 VLA보다 우수한 성능을 보인다. 이러한 결과는 잠재 행동 공간의 기하학적 구조가 시각적 사전 학습을 하위 로봇 정책 학습으로 전이하는 데 중요한 설계 선택임을 시사한다.

English

Latent action pretraining learns representations of visual change from pairs of observations, but existing methods typically encode each transition as a single unstructured representation that entangles transition extent and transition mode. We introduce Polar Latent Actions with Radial structure (PoLAR), which imposes a radial-direction structure on latent actions, encouraging radius to encode transition extent and direction to retain transition mode. PoLAR uses temporal offset between two observations as a weak proxy for transition extent, encouraging latent action from observation pairs separated by larger temporal gaps to occupy larger radii. We instantiate this structure in hyperbolic space, whose expanding volume with radius offers a natural fit for more diverse transition modes at larger extents. Across in-task and large-scale pretraining settings, PoLAR improves downstream policy performance in simulation and real-world robot experiments, outperforming latent action baselines and strong pretrained VLAs. These results suggest that the geometry of the latent action space is an important design choice for transferring visual pretraining to downstream robot policy learning.