확산 기반 이산 모션 토크나이저를 활용한 의미론적 조건과 운동학적 조건 간의 연계

초록

기존 동작 생성 연구는 주로 두 가지 패러다임을 따릅니다: 운동학적 제어에 뛰어난 연속 확산 모델과 의미적 조건 설정에 효과적인 이산 토큰 기반 생성기입니다. 양자의 장점을 결합하기 위해 우리는 조건 특징 추출(인지), 이산 토큰 생성(계획), 확산 기반 동작 합성(제어)의 세 단계로 구성된 프레임워크를 제안합니다. 이 프레임워크의 핵심은 MoTok으로, 동작 복원을 확산 디코더에 위임하여 의미적 추상화와 세밀한 재구성을 분리함으로써 동작 충실도를 유지하면서도 컴팩트한 단일 계층 토큰을 가능하게 하는 확산 기반 이산 동작 토크나이저입니다. 운동학적 조건의 경우, 계획 단계에서 대략적인 제약이 토큰 생성을 안내하고, 확산 기반 최적화를 통해 제어 단계에서 세밀한 제약이 적용됩니다. 이러한 설계는 운동학적 세부 사항이 의미적 토큰 계획을 방해하는 것을 방지합니다. HumanML3D에서 우리의 방법은 MaskControl 대비 토큰 수를 1/6만 사용하면서도 제어 가능성과 충실도를 크게 향상시켜 궤적 오차를 0.72cm에서 0.08cm로, FID를 0.083에서 0.029로 줄였습니다. 더 강한 운동학적 제약 하에서 성능이 저하되는 기존 방법과 달리, 우리 방법은 충실도를 향상시켜 FID를 0.033에서 0.014로 감소시켰습니다.

English

Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

확산 기반 이산 모션 토크나이저를 활용한 의미론적 조건과 운동학적 조건 간의 연계

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

초록

Support