MoBind: Motion Binding voor Nauwkeurige IMU-Video Pose-uitlijning

Samenvatting

Wij streven naar het leren van een gezamenlijke representatie tussen signalen van traagheidsmeeteenheden (IMU's) en 2D-pose-reeksen geëxtraheerd uit video, om nauwkeurige cross-modale retrievel, temporele synchronisatie, lokalisatie van personen en lichaamsdelen, en actieherkenning mogelijk te maken. Hiertoe introduceren wij MoBind, een hiërarchisch contrastief leerraamwerk ontworpen om drie uitdagingen aan te pakken: (1) het filteren van irrelevante visuele achtergrond, (2) het modelleren van gestructureerde multi-sensor IMU-configuraties, en (3) het bereiken van fijnmazige, sub-seconde temporele alignering. Om beweging-relevante signalen te isoleren, aligneert MoBind IMU-signalen met skeletbewegingsreeksen in plaats van met ruwe pixels. We ontbinden volledige lichaamsbeweging verder in lokale trajecten van lichaamsdelen, waarbij we elk koppelen aan de bijbehorende IMU om semantisch gefundeerde multi-sensor alignering mogelijk te maken. Om gedetailleerde temporele correspondentie vast te leggen, hanteert MoBind een hiërarchische contrastieve strategie die eerst temporele segmenten op tokenniveau aligneert, en vervolgens lokale (lichaamsdeel) alignering fuseert met globale (volledige lichaams) bewegingaggregatie. Geëvalueerd op mRi, TotalCapture en EgoHumans, presteert MoBind consistent beter dan sterke baseline-methoden voor alle vier taken, en demonstreert robuuste fijnmazige temporele alignering terwijl grove semantische consistentie tussen modaliteiten behouden blijft. Code is beschikbaar op https://github.com/bbvisual/MoBind.

English

We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities. Code is available at https://github.com/bbvisual/ MoBind.

MoBind: Motion Binding voor Nauwkeurige IMU-Video Pose-uitlijning

MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment

Samenvatting

Support