구조화된 동작 설명을 통한 인코더 없는 인간 동작 이해

초록

텍스트 기반 대규모 언어 모델(LLM)의 세계 지식과 추론 능력은 빠르게 발전하고 있지만, 동작 질의응답 및 캡션 생성을 포함한 현재의 인간 동작 이해 접근법은 이러한 능력을 충분히 활용하지 못하고 있습니다. 기존 LLM 기반 방법들은 일반적으로 동작 특징을 LLM의 임베딩 공간으로 투영하는 전용 인코더를 통해 동작-언어 정렬을 학습하며, 이는 여전히 교차 모달 표현과 정렬에 제약을 받습니다. 생체역학 분석에서 관절 각도와 신체 부위 운동학이 오랫동안 인간 움직임에 대한 정밀한 설명 언어로 사용되어 온 점에 착안하여, 우리는 규칙 기반의 결정론적 접근법인 구조적 동작 설명(SMD)을 제안합니다. SMD는 관절 위치 시퀀스를 관절 각도, 신체 부위 움직임, 전역 궤적에 대한 구조화된 자연어 설명으로 변환합니다. 동작을 텍스트로 표현함으로써 SMD는 LLM이 학습된 인코더나 정렬 모듈 없이도 신체 부위, 공간 방향, 운동 의미론에 대한 사전 학습된 지식을 동작 추론에 직접 적용할 수 있게 합니다. 우리는 이 접근법이 동작 질의응답(BABEL-QA 66.7%, HuMMan-QA 90.1%)과 동작 캡션 생성(HumanML3D에서 R@1 0.584, CIDEr 53.16) 모두에서 최첨단 결과를 넘어서며 기존의 모든 방법을 능가함을 보여줍니다. SMD는 추가적인 실용적 이점을 제공합니다: 동일한 텍스트 입력이 경량 LoRA 적응만으로 다른 LLM들 간에 동작하며(6개 모델 패밀리의 8개 LLM에서 검증됨), 인간이 읽을 수 있는 표현 덕분에 동작 설명에 대한 해석 가능한 어텐션 분석이 가능합니다. 코드, 데이터 및 사전 학습된 LoRA 어댑터는 https://yaozhang182.github.io/motion-smd/에서 이용할 수 있습니다.

English

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose Structured Motion Description (SMD), a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.

구조화된 동작 설명을 통한 인코더 없는 인간 동작 이해

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

초록

Support