DisCo: 현실 세계에서의 참조 인간 댄스 생성을 위한 분리 제어

초록

생성형 AI는 특히 텍스트 설명에 기반한 이미지/비디오 합성 분야에서 컴퓨터 비전에 있어 상당한 진전을 이루었습니다. 그러나 이러한 발전에도 불구하고, 특히 춤 합성과 같은 인간 중심 콘텐츠의 생성은 여전히 어려운 과제로 남아 있습니다. 기존의 춤 합성 방법들은 합성된 콘텐츠와 실제 춤 시나리오 간의 격차를 극복하는 데 어려움을 겪고 있습니다. 본 논문에서는 실제 춤 시나리오에 초점을 맞춘 새로운 문제 설정인 '참조 인간 춤 생성(Referring Human Dance Generation)'을 정의합니다. 이 설정은 다음과 같은 세 가지 중요한 특성을 갖습니다: (i) 충실성(Faithfulness): 합성 결과는 참조 이미지의 인간 주체 전경과 배경의 외관을 유지하고, 목표 자세를 정확히 따라야 합니다; (ii) 일반화 가능성(Generalizability): 모델은 보지 못한 인간 주체, 배경, 자세에 대해서도 일반화할 수 있어야 합니다; (iii) 구성 가능성(Compositionality): 서로 다른 출처에서 본/보지 못한 주체, 배경, 자세를 조합할 수 있어야 합니다. 이러한 과제를 해결하기 위해, 우리는 DISCO라는 새로운 접근 방식을 제안합니다. DISCO는 춤 합성의 충실성과 구성 가능성을 개선하기 위한 분리된 제어를 포함한 새로운 모델 아키텍처와, 보지 못한 인간에 대한 일반화 가능성을 높이기 위한 효과적인 인간 속성 사전 학습을 포함합니다. 광범위한 정성적 및 정량적 결과는 DISCO가 다양한 외관과 유연한 동작을 가진 고품질의 인간 춤 이미지와 비디오를 생성할 수 있음을 보여줍니다. 코드, 데모, 비디오 및 시각화 자료는 https://disco-dance.github.io/에서 확인할 수 있습니다.

English

Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code, demo, video and visualization are available at: https://disco-dance.github.io/.

DisCo: 현실 세계에서의 참조 인간 댄스 생성을 위한 분리 제어

DisCo: Disentangled Control for Referring Human Dance Generation in Real World

초록

Support