다중 인간 상호작용 대화 데이터셋

초록

기존의 대화형 비디오 생성 연구는 주로 단일 인물의 독백 또는 고립된 얼굴 애니메이션에 초점을 맞추어 왔으며, 이는 현실적인 다중 인간 상호작용에 대한 적용 가능성을 제한해 왔다. 이러한 격차를 해소하기 위해, 본 연구에서는 다중 인간 대화 비디오 생성을 위해 특별히 설계된 대규모 데이터셋인 MIT를 소개한다. 이를 위해, 다중 인물 대화 비디오를 수집하고 주석을 달 수 있는 자동화된 파이프라인을 개발하였다. 결과적으로 생성된 데이터셋은 2명에서 4명의 화자가 등장하는 12시간 분량의 고해상도 영상으로 구성되며, 신체 자세와 음성 상호작용에 대한 세밀한 주석이 포함되어 있다. 이 데이터셋은 다중 화자 시나리오에서의 자연스러운 대화 역학을 포착하여, 상호작용적 시각 행동 연구를 위한 풍부한 자원을 제공한다. MIT의 잠재력을 입증하기 위해, 본 연구에서는 이 새로운 과제를 위한 베이스라인 모델인 CovOG를 추가로 제안한다. CovOG는 다양한 수의 화자를 처리하기 위해 개별 자세 임베딩을 통합하는 다중 인간 자세 인코더(Multi-Human Pose Encoder, MPE)와 화자별 오디오 특성을 기반으로 머리 동역학을 조절하는 상호작용 오디오 드라이버(Interactive Audio Driver, IAD)를 통합한다. 이러한 구성 요소들은 현실적인 다중 인간 대화 비디오 생성의 가능성과 과제를 보여주며, MIT를 향후 연구를 위한 가치 있는 벤치마크로 확립한다. 코드는 https://github.com/showlab/Multi-human-Talking-Video-Dataset에서 확인할 수 있다.

English

Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.

다중 인간 상호작용 대화 데이터셋

Multi-human Interactive Talking Dataset

초록

Support