自己回帰拡散トランスフォーマーによるストリーミング同期空間オーディオ生成に向けて

要旨

リアルタイムかつ正確な空間音響生成は、没入感のある体験を提供する上で極めて重要である。しかし、既存の空間音響合成技術は、生成品質と高い推論遅延との間のトレードオフや、マルチモーダル入力から正確な空間情報を捉える難しさにしばしば妨げられている。これらの課題に取り組むため、我々はSwanSphereを提案する。これは、パノラマ動画とテキストプロンプトから高忠実度の空間音響生成を行うための統合ストリーミングフレームワークである。SwanSphereは主に以下の貢献を行う。1) ストリーミングによる高品質な空間音響生成を可能にする因果的自己回帰拡散トランスフォーマーアーキテクチャを導入する。2) 映像エンコーダを音響ドメインに整合させる空間的映像-音声対比学習（SVAC）戦略を設計し、さらに多目的オンライン直接選好最適化（ODPO）スキームを採用することで、強力な空間知覚とロバストなマルチモーダル空間音響合成を実現する。3) 現在の空間音響データセットの不足を緩和するため、詳細な空間キャプションを生成する自動アノテーションパイプラインも開発する。実験結果は、SwanSphereが映像から空間音響への生成タスクとテキストから空間音響への生成タスクの両方において優れた性能を達成することを示している。デモは https://swanaigc.github.io で公開されている。

English

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.