ChatPaper.aiChatPaper

ISDrama:通过多模态提示实现沉浸式空间戏剧生成

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

April 29, 2025
作者: Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao
cs.AI

摘要

多模态沉浸式空间戏剧生成致力于基于多模态提示,创造具有戏剧性韵律的连续多说话者双耳语音,其潜在应用包括增强现实(AR)、虚拟现实(VR)等领域。该任务需同时依据多模态输入建模空间信息与戏剧性韵律,数据采集成本高昂。据我们所知,本研究是首次尝试应对这些挑战。我们构建了MRSDrama,首个多模态录制的空间戏剧数据集,包含双耳戏剧音频、剧本、视频、几何姿态及文本提示。随后,我们提出了ISDrama,首个通过多模态提示的沉浸式空间戏剧生成模型。ISDrama主要由以下组件构成:1)基于对比学习的多模态姿态编码器,考虑移动说话者引起的多普勒效应,从多模态提示中提取统一姿态信息;2)沉浸式戏剧Transformer,一种基于流的Mamba-Transformer模型,通过引入Drama-MOE选择合适专家以增强韵律与姿态控制,生成高质量戏剧。我们还设计了一种上下文一致的无分类器引导策略,以连贯生成完整戏剧。实验结果表明,ISDrama在客观与主观指标上均优于基线模型。演示与数据集可访问https://aaronz345.github.io/ISDramaDemo。
English
Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components: 1) Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics. The demos and dataset are available at https://aaronz345.github.io/ISDramaDemo.

Summary

AI-Generated Summary

PDF71April 30, 2025