ISDrama: Immersieve Ruimtelijke Drama Generatie via Multimodale Prompting

Samenvatting

Multimodale immersieve ruimtelijke dramageneratie richt zich op het creëren van continue binaurale spraak met dramatische prosodie op basis van multimodale prompts, met potentiële toepassingen in AR, VR en andere domeinen. Deze taak vereist het gelijktijdig modelleren van ruimtelijke informatie en dramatische prosodie op basis van multimodale invoer, met hoge kosten voor gegevensverzameling. Voor zover wij weten, is ons werk de eerste poging om deze uitdagingen aan te pakken. We construeren MRSDrama, de eerste multimodale opgenomen ruimtelijke dramadataset, die binaurale drama-audio’s, scripts, video’s, geometrische poses en tekstuele prompts bevat. Vervolgens stellen we ISDrama voor, het eerste immersieve ruimtelijke dramageneratiemodel via multimodale prompting. ISDrama bestaat uit deze primaire componenten: 1) Multimodale Pose Encoder, gebaseerd op contrastief leren, waarbij rekening wordt gehouden met het Doppler-effect veroorzaakt door bewegende sprekers om uniforme pose-informatie uit multimodale prompts te extraheren. 2) Immersive Drama Transformer, een flow-based mamba-transformer model dat hoogwaardig drama genereert, waarbij Drama-MOE wordt gebruikt om geschikte experts te selecteren voor verbeterde prosodie en pose-controle. We ontwerpen ook een context-consistente classifier-free guidance strategie om coherent volledig drama te genereren. Experimentele resultaten tonen aan dat ISDrama baseline-modellen overtreft op zowel objectieve als subjectieve metrieken. De demo’s en dataset zijn beschikbaar op https://aaronz345.github.io/ISDramaDemo.

English

Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components: 1) Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics. The demos and dataset are available at https://aaronz345.github.io/ISDramaDemo.

ISDrama: Immersieve Ruimtelijke Drama Generatie via Multimodale Prompting

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Samenvatting

Summary

Support

Support