構造化モーション記述によるエンコーダ不要の人間動作理解

要旨

テキストベースの大規模言語モデル（LLM）の世界知識と推論能力は急速に進歩しているが、モーション質問応答やキャプション生成を含む現在の人間動作理解の手法は、これらの能力を十分に活用できていない。既存のLLMベースの手法は、典型的には、モーション特徴をLLMの埋め込み空間に投影する専用エンコーダーを通じてモーションと言語のアライメントを学習するものであり、クロスモーダル表現とアライメントの制約に縛られている。生体力学解析において、関節角度や身体部位の運動学が長らく人間の動きを記述する精密な言語として機能してきたことに着想を得て、我々は**構造化モーション記述（Structured Motion Description, SMD）**を提案する。これはルールベースの決定論的手法であり、関節位置の時系列を、関節角度、身体部位の動き、および全身の軌道に関する構造化された自然言語記述に変換する。モーションをテキストとして表現することで、SMDはLLMが身体部位、空間方向、運動の意味論に関する事前学習済み知識を、学習済みエンコーダーやアライメントモジュールを必要とせずに、直接モーション推論に適用することを可能にする。この手法が、モーション質問応答（BABEL-QAで66.7%、HuMMan-QAで90.1%）とモーションキャプション生成（HumanML3DでR@1が0.584、CIDErが53.16）の両方において、従来のすべての手法を凌駕する最新の結果を超えることを示す。SMDはさらに実用的な利点を提供する：同一のテキスト入力が、軽量なLoRA適応のみで異なるLLM間で機能し（6つのモデルファミリーから8つのLLMで検証）、その人間可読な表現は、モーション記述に対する解釈可能な注意分析を可能にする。コード、データ、および事前学習済みLoRAアダプターはhttps://yaozhang182.github.io/motion-smd/で公開されている。

English

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose Structured Motion Description (SMD), a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.

構造化モーション記述によるエンコーダ不要の人間動作理解

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

要旨

Support