ChatPaper.aiChatPaper

SoundReactor:幀級線上影片至音訊生成系統

SoundReactor: Frame-level Online Video-to-Audio Generation

October 2, 2025
作者: Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
cs.AI

摘要

現有的視頻到音頻(V2A)生成模型通常以離線方式運行,假設整個視頻序列或幀塊已事先可用。這嚴重限制了它們在互動應用中的使用,如實時內容創作和新興的生成世界模型。為解決這一問題,我們引入了幀級在線V2A生成的新任務,其中模型自迴歸地從視頻生成音頻,而無需訪問未來的視頻幀。此外,我們提出了SoundReactor,據我們所知,這是第一個專門為此任務設計的簡單而有效的框架。我們的設計強制端到端的因果性,並針對低每幀延遲與音視頻同步。我們模型的骨幹是一個僅解碼器的因果變換器,作用於連續的音頻潛在表示。對於視覺條件,它利用了從DINOv2視覺編碼器的最小變體中提取的網格(補丁)特徵,這些特徵被聚合為每幀單個令牌,以保持端到端的因果性和效率。模型通過擴散預訓練和一致性微調進行訓練,以加速擴散頭的解碼。在來自AAA遊戲的多元化遊戲視頻基準測試中,我們的模型成功生成了語義和時間上對齊的高質量全頻段立體聲音頻,並通過客觀和人類評估進行了驗證。此外,我們的模型在30FPS、480p視頻上使用單個H100實現了低每幀波形級延遲(NFE=1時為26.3ms,NFE=4時為31.5ms)。演示樣本可在https://koichi-saito-sony.github.io/soundreactor/獲取。
English
Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.
PDF22October 6, 2025