MoBind：面向精细IMU-视频姿态对齐的运动绑定技术

摘要

我们旨在学习惯性测量单元（IMU）信号与视频中提取的二维姿态序列的联合表征，以实现精准的跨模态检索、时间同步、受试者及身体部位定位以及动作识别。为此，我们提出MoBind——一种分层对比学习框架，专门解决三大挑战：（1）过滤无关视觉背景；（2）建模结构化多传感器IMU配置；（3）实现细粒度亚秒级时间对齐。为分离运动相关特征，MoBind将IMU信号与骨骼运动序列（而非原始像素）进行对齐。我们进一步将全身运动分解为局部身体部位轨迹，并将其与对应IMU传感器配对，从而实现基于语义的多传感器对齐。为捕捉细粒度时间对应关系，MoBind采用分层对比策略：先对齐令牌级时间片段，再将局部（身体部位）对齐与全局（全身）运动聚合相融合。在mRi、TotalCapture和EgoHumans数据集上的实验表明，MoBind在四项任务中均稳定超越强基线模型，在保持跨模态粗粒度语义一致性的同时，实现了鲁棒的细粒度时间对齐。代码已开源：https://github.com/bbvisual/MoBind。

English

We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.

MoBind：面向精细IMU-视频姿态对齐的运动绑定技术

MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

摘要

Support