ChatPaper.aiChatPaper

MoBind:基于运动绑定的细粒度IMU-视频姿态对齐

MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

February 22, 2026
作者: Duc Duy Nguyen, Tat-Jun Chin, Minh Hoai
cs.AI

摘要

我们旨在学习惯性测量单元(IMU)信号与视频中提取的二维姿态序列的联合表征,以实现精确的跨模态检索、时间同步、受试者及身体部位定位以及动作识别。为此,我们提出MoBind——一种分层对比学习框架,专门解决三大挑战:(1)过滤无关视觉背景;(2)建模结构化多传感器IMU配置;(3)实现细粒度亚秒级时间对齐。为分离运动相关特征,MoBind将IMU信号与骨骼运动序列而非原始像素对齐。我们进一步将全身运动分解为局部身体部位轨迹,并将其与对应IMU配对,实现基于语义的多传感器对齐。为捕捉精细时间对应关系,MoBind采用分层对比策略:先对齐令牌级时间片段,再将局部(身体部位)对齐与全局(全身)运动聚合相融合。在mRi、TotalCapture和EgoHumans数据集上的评估表明,MoBind在全部四项任务中均优于强基线模型,在保持跨模态粗粒度语义一致性的同时,展现出鲁棒的细粒度时间对齐能力。代码已开源:https://github.com/bbvisual/MoBind。
English
We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.
PDF12February 27, 2026