基于结构化运动描述的无编码器人体运动理解
Encoder-Free Human Motion Understanding via Structured Motion Descriptions
April 23, 2026
作者: Yao Zhang, Zhuchenyang Liu, Thomas Ploetz, Yu Xiao
cs.AI
摘要
基于文本的大型语言模型(LLM)的世界知识与推理能力正在飞速发展,但当前人体动作理解方法(包括动作问答与描述生成)尚未充分利用这些能力。现有基于LLM的方法通常通过专用编码器将动作特征投影至LLM嵌入空间来学习动作-语言对齐,仍受限于跨模态表示与对齐机制。受生物力学分析的启发——关节角度与身体部位运动学长期以来作为人体运动的精确描述语言,我们提出结构化动作描述(SMD),这是一种基于规则的确定性方法,可将关节位置序列转化为描述关节角度、肢体运动及全局轨迹的结构化自然语言。通过将动作表示为文本,SMD使LLM能够直接运用其预训练获得的关于身体部位、空间方向与运动语义的知识进行动作推理,无需依赖学习型编码器或对齐模块。实验表明,该方法在动作问答(BABEL-QA达66.7%,HuMMan-QA达90.1%)和动作描述生成(HumanML3D上R@1为0.584,CIDEr为53.16)任务上均超越现有最优成果,优于所有先前方法。SMD还具备实用优势:同一文本输入仅需轻量级LoRA适配即可适用于不同LLM(在6个模型家族的8个LLM上验证),其人类可读的表示形式支持对动作描述进行可解释的注意力分析。代码、数据及预训练LoRA适配器已发布于https://yaozhang182.github.io/motion-smd/。
English
The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose Structured Motion Description (SMD), a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.