多人互動對話數據集

摘要

現有針對說話視頻生成的研究主要集中於單人獨白或孤立的面部動畫，這限制了其在現實多人互動場景中的應用。為彌補這一差距，我們引入了MIT，這是一個專為多人說話視頻生成而設計的大規模數據集。為此，我們開發了一個自動化流程，用於收集和標註多人對話視頻。最終的數據集包含12小時的高清視頻，每段視頻中均有二至四位發言者，並附有細緻的身體姿態和語音互動標註。該數據集捕捉了多發言者場景中的自然對話動態，為研究互動視覺行為提供了豐富的資源。為展示MIT的潛力，我們進一步提出了CovOG，作為這一新任務的基準模型。它整合了一個多人姿態編碼器（MPE），通過聚合個體姿態嵌入來處理不同數量的發言者，以及一個互動音頻驅動器（IAD），根據發言者特定的音頻特徵調節頭部動態。這些組件共同展示了生成逼真多人說話視頻的可行性與挑戰，使MIT成為未來研究的重要基準。代碼已開源於：https://github.com/showlab/Multi-human-Talking-Video-Dataset。

English

Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.

多人互動對話數據集

Multi-human Interactive Talking Dataset

摘要

Support