多人交互对话数据集

摘要

现有关于对话视频生成的研究主要集中于单人独白或孤立的面部动画，这限制了其在真实多人互动场景中的适用性。为填补这一空白，我们推出了MIT，一个专为多人对话视频生成设计的大规模数据集。为此，我们开发了一套自动化流程，用于收集并标注多人对话视频。该数据集最终包含12小时的高清视频，每段视频中展示二至四位发言者，并配有细致的身体姿态与语音互动标注。它捕捉了多说话者情境下的自然对话动态，为研究互动视觉行为提供了丰富的资源。为展示MIT的潜力，我们进一步提出了CovOG，作为这一新任务的基线模型。该模型整合了多人体姿态编码器（MPE），通过聚合个体姿态嵌入来处理不同数量的说话者，以及互动音频驱动器（IAD），依据说话者特定的音频特征来调节头部动态。这些组件共同展示了生成逼真多人对话视频的可行性与挑战，确立了MIT作为未来研究的重要基准。代码已发布于：https://github.com/showlab/Multi-human-Talking-Video-Dataset。

English

Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.

多人交互对话数据集

Multi-human Interactive Talking Dataset

摘要

Support