DyaDiT: 사회적으로 바람직한 이인간 제스처 생성을 위한 다중 모달 디퓨전 트랜스포머

초록

현실적인 대화적 제스처 생성은 디지털 휴먼과 자연스럽고 사회적으로 매력적인 상호작용을 달성하는 데 필수적입니다. 그러나 기존 방법들은 일반적으로 단일 오디오 스트림을 한 화자의 동작에 매핑할 뿐, 사회적 맥락을 고려하거나 대화에 참여하는 두 사람 간의 상호 역동성을 모델링하지 않습니다. 본 논문에서는 다이어딕(dyadic) 오디오 신호로부터 맥락에 적합한 인간 동작을 생성하는 다중 모드 확산 트랜스포머인 DyaDiT를 제안합니다. Seamless Interaction Dataset으로 학습된 DyaDiT는 선택적 사회적 맥락 토큰과 함께 다이어딕 오디오를 입력받아 맥락에 적절한 동작을 생성합니다. 이 모델은 양쪽 화자의 정보를 융합하여 상호작용 역동성을 포착하고, 모션 사전을 사용하여 모션 사전 지식을 인코딩하며, 선택적으로 대화 상대의 제스처를 활용하여 더 반응적인 동작을 생성할 수 있습니다. 우리는 DyaDiT를 표준 모션 생성 메트릭으로 평가하고 정량적 사용자 연구를 수행하여, 이 방법이 객관적 지표에서 기존 방법을 능가할 뿐만 아니라 사용자들에게도 강력하게 선호되어 그 견고성과 사회적으로 바람직한 모션 생성 능력을 입증했습니다. 코드와 모델은 논문 승인 후 공개될 예정입니다.

English

Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

DyaDiT: 사회적으로 바람직한 이인간 제스처 생성을 위한 다중 모달 디퓨전 트랜스포머

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

초록

Support