H^{3}DP:面向视觉运动学习的三重层次扩散策略
H^{3}DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning
May 12, 2025
作者: Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, Huazhe Xu
cs.AI
摘要
視覺運動策略學習在機器人操作領域取得了顯著進展,近期方法主要依賴於生成模型來模擬動作分佈。然而,這些方法往往忽視了視覺感知與動作預測之間的重要耦合關係。在本研究中,我們提出了三重層次擴散策略(H^{\mathbf{3}DP}),這是一種新穎的視覺運動學習框架,它明確地引入了層次結構以加強視覺特徵與動作生成之間的整合。H^{3}DP包含三個層次:(1)基於深度信息的深度感知輸入分層,用於組織RGB-D觀測數據;(2)多尺度視覺表徵,編碼不同粒度層次的語義特徵;以及(3)層次條件化的擴散過程,使從粗到細的動作生成與相應的視覺特徵保持一致。大量實驗表明,H^{3}DP在44個模擬任務中相較於基準方法平均提升了27.5%的相對性能,並在4個具有挑戰性的雙手機器人實際操作任務中表現卓越。項目頁面:https://lyy-iiis.github.io/h3dp/。
English
Visuomotor policy learning has witnessed substantial progress in robotic
manipulation, with recent approaches predominantly relying on generative models
to model the action distribution. However, these methods often overlook the
critical coupling between visual perception and action prediction. In this
work, we introduce Triply-Hierarchical Diffusion
Policy~(H^{\mathbf{3}DP}), a novel visuomotor learning framework
that explicitly incorporates hierarchical structures to strengthen the
integration between visual features and action generation. H^{3}DP contains
3 levels of hierarchy: (1) depth-aware input layering that organizes
RGB-D observations based on depth information; (2) multi-scale visual
representations that encode semantic features at varying levels of granularity;
and (3) a hierarchically conditioned diffusion process that aligns the
generation of coarse-to-fine actions with corresponding visual features.
Extensive experiments demonstrate that H^{3}DP yields a +27.5%
average relative improvement over baselines across 44 simulation
tasks and achieves superior performance in 4 challenging bimanual
real-world manipulation tasks. Project Page: https://lyy-iiis.github.io/h3dp/.Summary
AI-Generated Summary