基于激光雷达点云的人机交互学习在三维人体姿态估计中的应用

摘要

基于激光雷达点云的人体理解因其与行人安全的密切关联，成为自动驾驶领域最核心的任务之一。然而在复杂人-物交互和杂乱背景的干扰下，该任务仍面临严峻挑战。现有方法大多忽视了利用人-物交互构建鲁棒三维人体姿态估计框架的潜力。推动人-物交互融合的挑战主要来自两方面：首先，人-物交互会引发人体与物体点云的空间模糊性，常导致交互区域的三维人体关键点预测错误；其次，交互与非交互身体部位的点云数量存在严重类别不平衡，手、足等高频交互部位在激光雷达数据中观测稀疏。针对这些挑战，我们提出人-物交互学习框架（HOIL），用于从激光雷达点云实现鲁棒的三维人体姿态估计。为缓解空间模糊性问题，我们提出人-物交互感知对比学习（HOICL），有效增强交互区域人体与物体点云的特征区分度；针对类别不平衡问题，引入接触感知部件引导池化（CPPool），通过压缩过表征点云同时保留交互部位信息点，实现表征能力的自适应重分配。此外，我们还提出基于接触关系的时序优化模块，利用连续帧间的接触线索修正单帧关键点估计误差。实验表明，HOIL框架能有效利用人-物交互解决交互区域的空间模糊性与类别不平衡问题。代码将开源发布。

English

Understanding humans from LiDAR point clouds is one of the most critical tasks in autonomous driving due to its close relationships with pedestrian safety, yet it remains challenging in the presence of diverse human-object interactions and cluttered backgrounds. Nevertheless, existing methods largely overlook the potential of leveraging human-object interactions to build robust 3D human pose estimation frameworks. There are two major challenges that motivate the incorporation of human-object interaction. First, human-object interactions introduce spatial ambiguity between human and object points, which often leads to erroneous 3D human keypoint predictions in interaction regions. Second, there exists severe class imbalance in the number of points between interacting and non-interacting body parts, with the interaction-frequent regions such as hand and foot being sparsely observed in LiDAR data. To address these challenges, we propose a Human-Object Interaction Learning (HOIL) framework for robust 3D human pose estimation from LiDAR point clouds. To mitigate the spatial ambiguity issue, we present human-object interaction-aware contrastive learning (HOICL) that effectively enhances feature discrimination between human and object points, particularly in interaction regions. To alleviate the class imbalance issue, we introduce contact-aware part-guided pooling (CPPool) that adaptively reallocates representational capacity by compressing overrepresented points while preserving informative points from interacting body parts. In addition, we present an optional contact-based temporal refinement that refines erroneous per-frame keypoint estimates using contact cues over time. As a result, our HOIL effectively leverages human-object interaction to resolve spatial ambiguity and class imbalance in interaction regions. Codes will be released.

基于激光雷达点云的人机交互学习在三维人体姿态估计中的应用

Learning Human-Object Interaction for 3D Human Pose Estimation from LiDAR Point Clouds

摘要

Support