H^{3}DP: 시각운동 학습을 위한 삼중 계층적 확산 정책

초록

시각운동 정책 학습은 로봇 매니퓰레이션 분야에서 상당한 진전을 이루어 왔으며, 최근 접근법들은 주로 생성 모델을 활용하여 행동 분포를 모델링하는 데 의존해 왔습니다. 그러나 이러한 방법들은 시각 인식과 행동 예측 간의 중요한 상호 연관성을 종종 간과해 왔습니다. 본 연구에서는 이러한 문제를 해결하기 위해 Triply-Hierarchical Diffusion Policy~(H^{\mathbf{3}DP})라는 새로운 시각운동 학습 프레임워크를 소개합니다. H^{3}DP는 시각적 특징과 행동 생성 간의 통합을 강화하기 위해 계층적 구조를 명시적으로 통합합니다. H^{3}DP는 3가지 수준의 계층 구조를 포함합니다: (1) 깊이 정보를 기반으로 RGB-D 관측을 조직화하는 깊이 인식 입력 계층화, (2) 다양한 세분화 수준에서 의미론적 특징을 인코딩하는 다중 스케일 시각 표현, 그리고 (3) 거친 행동에서 세밀한 행동까지의 생성을 해당 시각적 특징과 정렬시키는 계층적 조건부 확산 과정. 광범위한 실험을 통해 H^{3}DP가 44개의 시뮬레이션 작업에서 기준선 대비 평균 +27.5%의 상대적 개선을 달성했으며, 4개의 도전적인 양손 실세계 매니퓰레이션 작업에서도 우수한 성능을 보여주었음을 입증했습니다. 프로젝트 페이지: https://lyy-iiis.github.io/h3dp/.

English

Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce Triply-Hierarchical Diffusion Policy~(H^{\mathbf{3}DP}), a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H^{3}DP contains 3 levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H^{3}DP yields a +27.5% average relative improvement over baselines across 44 simulation tasks and achieves superior performance in 4 challenging bimanual real-world manipulation tasks. Project Page: https://lyy-iiis.github.io/h3dp/.