ChatPaper.aiChatPaper

GeoMotionGPT:基于几何对齐运动理解的类GPT大语言模型

GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models

January 12, 2026
作者: Zhankai Ye, Bofan Li, Yukai Jin, Shuoqiu Li, Wei Wang, Yanfu Zhang, Shangqian Gao, Xin Liu
cs.AI

摘要

离散运动标记化技术近年来使得大型语言模型(LLM)能够作为运动理解与运动-语言推理的多功能基础架构。然而,现有技术流程通常将运动量化与语义嵌入学习相互解耦,仅通过标记ID建立关联。这种方法未能有效对齐运动空间与嵌入空间的内在几何结构,从而限制了LLM进行精细运动推理的能力。我们认为,当两种模态共享统一的几何基础时,对齐效果最为显著。为此,我们提出了一种新颖框架,通过显式施加正交性约束于运动码本和LLM嵌入空间,使其关系结构自然映射,而非强制LLM从零开始重构运动标记间的复杂几何关系。具体而言,我们采用基于Gumbel-Softmax的仅解码器量化器,实现可微分训练与均衡的码本使用;通过稀疏投影在保持正交性的前提下将运动编码映射至LLM嵌入空间;最后设计两阶段正交正则化方案,在标记器训练与LLM微调过程中施加软约束,在维持几何对齐的同时不阻碍语义适配。在HumanML3D数据集上的大量实验表明,我们的框架相较现有最优方法性能提升20%,验证了统一几何基础能有效增强LLM的精细运动推理能力。
English
Discrete motion tokenization has recently enabled Large Language Models (LLMs) to serve as versatile backbones for motion understanding and motion-language reasoning. However, existing pipelines typically decouple motion quantization from semantic embedding learning, linking them solely via token IDs. This approach fails to effectively align the intrinsic geometry of the motion space with the embedding space, thereby hindering the LLM's capacity for nuanced motion reasoning. We argue that alignment is most effective when both modalities share a unified geometric basis. Therefore, instead of forcing the LLM to reconstruct the complex geometry among motion tokens from scratch, we present a novel framework that explicitly enforces orthogonality on both the motion codebook and the LLM embedding space, ensuring that their relational structures naturally mirror each other. Specifically, we employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage. To bridge the modalities, we use a sparse projection that maps motion codes into the LLM embedding space while preserving orthogonality. Finally, a two-stage orthonormal regularization schedule enforces soft constraints during tokenizer training and LLM fine-tuning to maintain geometric alignment without hindering semantic adaptation. Extensive experiments on HumanML3D demonstrate that our framework achieves a 20% performance improvement over current state-of-the-art methods, validating that a unified geometric basis effectively empowers the LLM for nuanced motion reasoning.
PDF11January 15, 2026