DrawMotion: 프리핸드 드로잉을 통한 3D 인간 동작 생성

초록

텍스트-동작 생성은 텍스트 설명을 인간 동작으로 변환하는 작업으로, 사용자가 자신의 의도한 동작을 텍스트만으로 정확히 전달하는 데 어려움을 겪는 문제에 직면한다. 이 문제를 해결하기 위해 본 논문은 다중 조건 시나리오를 위해 설계된 효율적인 확산 기반 프레임워크인 DrawMotion을 소개한다. DrawMotion은 기존의 텍스트 조건과 새로운 손그림 조건을 기반으로 동작을 생성하며, 이 두 조건은 생성된 동작에 대해 각각 의미적 제어와 공간적 제어를 제공한다. 구체적으로, 우리는 세 가지 관점에서 세밀한 동작 생성 작업을 다룬다: 1) 자유 손그림 조건. 사용자의 번거로운 텍스트 입력 없이 의도된 동작을 정확히 포착하기 위해, 다양한 데이터셋 형식에 걸쳐 손으로 그린 막대기 인간 스케치를 자동으로 생성하는 알고리즘을 개발한다; 2) 다중 조건 융합. 확산 과정에 통합되는 다중 조건 모듈(MCM)을 제안하여, 기존 접근법에 비해 계산 복잡성을 줄이면서 모델이 가능한 모든 조건 조합을 활용할 수 있도록 한다; 3) 학습 없는 안내. 주목할 점은 DrawMotion의 MCM이 중간 특성들이 연속적인 공간에 존재하도록 보장하여, 분류기 안내 그래디언트가 특성들을 업데이트할 수 있게 함으로써 생성된 동작이 충실도를 유지하면서 사용자 의도와 일치하도록 한다는 것이다. 정량적 실험과 사용자 연구는 자유 손그림 접근법이 사용자가 상상에 부합하는 동작을 생성할 때 약 46.7%의 시간을 절약함을 보여준다. 코드, 데모 및 관련 데이터는 https://github.com/InvertedForest/DrawMotion에서 공개적으로 이용 가능하다.

English

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.