ChatPaper.aiChatPaper

GeoMotionGPT:基于大语言模型的几何对齐运动理解系统

GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models

January 12, 2026
作者: Zhankai Ye, Bofan Li, Yukai Jin, Shuoqiu Li, Wei Wang, Yanfu Zhang, Shangqian Gao, Xin Liu
cs.AI

摘要

离散运动标记化技术近期使得大语言模型(LLM)能够作为运动理解与运动-语言推理的多功能基础架构。然而,现有方案通常将运动量化与语义嵌入学习解耦,仅通过标记ID建立关联。这种方法未能有效对齐运动空间与嵌入空间的内在几何结构,从而限制了LLM进行精细运动推理的能力。我们认为当两种模态共享统一几何基础时,对齐效果最为显著。为此,我们提出一种新颖框架,通过强制运动码本和LLM嵌入空间满足正交性约束,使其关系结构自然映射,而非强迫LLM从零开始重构运动标记间的复杂几何关系。具体而言,我们采用带Gumbel-Softmax的仅解码器量化器实现可微分训练与均衡的码本使用;通过稀疏投影在保持正交性的前提下将运动编码映射至LLM嵌入空间;最后设计两阶段正交正则化方案,在标记器训练与LLM微调过程中施加软约束,在维持几何对齐的同时不阻碍语义适配。在HumanML3D数据集上的大量实验表明,本框架相较现有最优方法性能提升20%,验证了统一几何基础能有效增强LLM的精细运动推理能力。
English
Discrete motion tokenization has recently enabled Large Language Models (LLMs) to serve as versatile backbones for motion understanding and motion-language reasoning. However, existing pipelines typically decouple motion quantization from semantic embedding learning, linking them solely via token IDs. This approach fails to effectively align the intrinsic geometry of the motion space with the embedding space, thereby hindering the LLM's capacity for nuanced motion reasoning. We argue that alignment is most effective when both modalities share a unified geometric basis. Therefore, instead of forcing the LLM to reconstruct the complex geometry among motion tokens from scratch, we present a novel framework that explicitly enforces orthogonality on both the motion codebook and the LLM embedding space, ensuring that their relational structures naturally mirror each other. Specifically, we employ a decoder-only quantizer with Gumbel-Softmax for differentiable training and balanced codebook usage. To bridge the modalities, we use a sparse projection that maps motion codes into the LLM embedding space while preserving orthogonality. Finally, a two-stage orthonormal regularization schedule enforces soft constraints during tokenizer training and LLM fine-tuning to maintain geometric alignment without hindering semantic adaptation. Extensive experiments on HumanML3D demonstrate that our framework achieves a 20% performance improvement over current state-of-the-art methods, validating that a unified geometric basis effectively empowers the LLM for nuanced motion reasoning.
PDF11January 15, 2026