H^{3}DP:视觉运动学习中的三重层次扩散策略
H^{3}DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning
May 12, 2025
作者: Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, Huazhe Xu
cs.AI
摘要
视觉运动策略学习在机器人操控领域取得了显著进展,近期方法主要依赖生成模型来建模动作分布。然而,这些方法往往忽视了视觉感知与动作预测之间的关键耦合关系。在本研究中,我们提出了三重层次扩散策略(H^{\mathbf{3}DP}),这是一种新颖的视觉运动学习框架,它通过显式引入层次结构来强化视觉特征与动作生成之间的整合。H^{3}DP包含三个层次的架构:(1)基于深度信息组织RGB-D观测的深度感知输入分层;(2)在不同粒度级别编码语义特征的多尺度视觉表示;以及(3)与相应视觉特征对齐的从粗到细动作生成的层次条件扩散过程。大量实验表明,H^{3}DP在44个模拟任务中相比基线平均提升了+27.5%,并在4项具有挑战性的双手现实世界操控任务中展现了卓越性能。项目页面:https://lyy-iiis.github.io/h3dp/。
English
Visuomotor policy learning has witnessed substantial progress in robotic
manipulation, with recent approaches predominantly relying on generative models
to model the action distribution. However, these methods often overlook the
critical coupling between visual perception and action prediction. In this
work, we introduce Triply-Hierarchical Diffusion
Policy~(H^{\mathbf{3}DP}), a novel visuomotor learning framework
that explicitly incorporates hierarchical structures to strengthen the
integration between visual features and action generation. H^{3}DP contains
3 levels of hierarchy: (1) depth-aware input layering that organizes
RGB-D observations based on depth information; (2) multi-scale visual
representations that encode semantic features at varying levels of granularity;
and (3) a hierarchically conditioned diffusion process that aligns the
generation of coarse-to-fine actions with corresponding visual features.
Extensive experiments demonstrate that H^{3}DP yields a +27.5%
average relative improvement over baselines across 44 simulation
tasks and achieves superior performance in 4 challenging bimanual
real-world manipulation tasks. Project Page: https://lyy-iiis.github.io/h3dp/.Summary
AI-Generated Summary