ISDrama: マルチモーダルプロンプティングによる没入型空間ドラマ生成

要旨

マルチモーダル没入型空間ドラマ生成は、マルチモーダルプロンプトに基づいて、劇的なプロソディを持つ連続的なマルチスピーカーのバイノーラル音声を生成することに焦点を当てており、ARやVRなどへの応用が期待されています。このタスクでは、マルチモーダル入力に基づいて空間情報と劇的なプロソディを同時にモデル化する必要があり、データ収集コストが高いという課題があります。私たちの知る限り、本研究はこれらの課題に取り組む最初の試みです。私たちは、バイノーラルドラマ音声、スクリプト、ビデオ、幾何学的ポーズ、テキストプロンプトを含む、初のマルチモーダル記録空間ドラマデータセットであるMRSDramaを構築しました。次に、マルチモーダルプロンプトを通じて没入型空間ドラマを生成する初のモデルであるISDramaを提案します。ISDramaは以下の主要コンポーネントで構成されています：1) マルチモーダルポーズエンコーダー。コントラスティブ学習に基づき、移動するスピーカーによるドップラー効果を考慮して、マルチモーダルプロンプトから統一されたポーズ情報を抽出します。2) 没入型ドラマトランスフォーマー。フローベースのマンバトランスフォーマーモデルで、高品質なドラマを生成し、Drama-MOEを組み込んで適切なエキスパートを選択し、プロソディとポーズ制御を強化します。また、コンテキスト一貫性のあるクラシファイアーフリーガイダンス戦略を設計し、一貫性のある完全なドラマを生成します。実験結果は、ISDramaが客観的および主観的指標においてベースラインモデルを上回ることを示しています。デモとデータセットはhttps://aaronz345.github.io/ISDramaDemoで公開されています。

English

Multimodal immersive spatial drama generation focuses on creating continuous multi-speaker binaural speech with dramatic prosody based on multimodal prompts, with potential applications in AR, VR, and others. This task requires simultaneous modeling of spatial information and dramatic prosody based on multimodal inputs, with high data collection costs. To the best of our knowledge, our work is the first attempt to address these challenges. We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting. ISDrama comprises these primary components: 1) Multimodal Pose Encoder, based on contrastive learning, considering the Doppler effect caused by moving speakers to extract unified pose information from multimodal prompts. 2) Immersive Drama Transformer, a flow-based mamba-transformer model that generates high-quality drama, incorporating Drama-MOE to select proper experts for enhanced prosody and pose control. We also design a context-consistent classifier-free guidance strategy to coherently generate complete drama. Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics. The demos and dataset are available at https://aaronz345.github.io/ISDramaDemo.