基于关节角度运动图像与令牌-补丁延迟交互的细粒度运动检索
Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
March 10, 2026
作者: Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao
cs.AI
摘要
文本-动作检索旨在自然语言描述与三维人体运动骨架序列之间学习语义对齐的潜在空间,实现跨模态双向搜索。现有方法多采用双编码器框架,将动作和文本压缩为全局嵌入向量,但会丢弃细粒度局部对应关系,从而降低检索精度。此外,这类全局嵌入方法对检索结果的可解释性有限。为突破这些局限,我们提出一种基于关节角度的可解释运动表征方法,将关节级局部特征映射为结构化伪图像,使其与预训练视觉Transformer兼容。针对文本到动作检索任务,我们采用基于令牌的延迟交互机制MaxSim,并通过掩码语言建模正则化增强其鲁棒性,从而建立可解释的文本-动作对齐关系。在HumanML3D和KIT-ML数据集上的大量实验表明,本方法在超越现有最优文本-动作检索方法的同时,能够提供文本与动作间可解释的细粒度对应关系。代码详见补充材料。
English
Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.