MoBind:面向精细IMU-视频姿态对齐的运动绑定技术
MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment
February 22, 2026
作者: Duc Duy Nguyen, Tat-Jun Chin, Minh Hoai
cs.AI
摘要
我们旨在学习惯性测量单元(IMU)信号与视频中提取的二维姿态序列的联合表征,以实现精准的跨模态检索、时间同步、受试者及身体部位定位以及动作识别。为此,我们提出MoBind——一种分层对比学习框架,专门解决三大挑战:(1)过滤无关视觉背景;(2)建模结构化多传感器IMU配置;(3)实现细粒度亚秒级时间对齐。为分离运动相关特征,MoBind将IMU信号与骨骼运动序列(而非原始像素)进行对齐。我们进一步将全身运动分解为局部身体部位轨迹,并将其与对应IMU传感器配对,从而实现基于语义的多传感器对齐。为捕捉细粒度时间对应关系,MoBind采用分层对比策略:先对齐令牌级时间片段,再将局部(身体部位)对齐与全局(全身)运动聚合相融合。在mRi、TotalCapture和EgoHumans数据集上的实验表明,MoBind在四项任务中均稳定超越强基线模型,在保持跨模态粗粒度语义一致性的同时,实现了鲁棒的细粒度时间对齐。代码已开源:https://github.com/bbvisual/MoBind。
English
We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.