微細動作検索のための関節角度モーション画像とトークンパッチ遅延インタラクション

要旨

テキスト-モーション検索は、自然言語記述と3次元人体モーション骨格シーケンス間の意味的に整合した潜在空間を学習し、二つのモダリティ間での双方向検索を可能にすることを目的としている。既存手法の多くは、モーションとテキストを大域的埋め込みに圧縮するデュアルエンコーダフレームワークを採用しており、細粒度の局所的対応関係が失われるため精度が低下する。さらに、これらの大域的埋め込み手法は検索結果の解釈性に限界がある。これらの課題を克服するため、我々は関節レベルの局所特徴を構造化された擬似画像にマッピングし、事前学習済みVision Transformerとの互換性を持つ、解釈可能な関節角ベースのモーション表現を提案する。テキストからモーションへの検索においては、トークンワイズ後期相互作用機構であるMaxSimを採用し、Masked Language Modelingによる正則化を加えることで、頑健で解釈可能なテキスト-モーション整合を促進する。HumanML3DとKIT-MLにおける大規模な実験により、本手法が最先端のテキスト-モーション検索手法を上回る性能を発揮するとともに、テキストとモーション間の解釈可能な細粒度対応を提供することを示す。コードは付録資料で公開している。

English

Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.

微細動作検索のための関節角度モーション画像とトークンパッチ遅延インタラクション

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

要旨

Support