ChatPaper.aiChatPaper

SoundReactor:帧级在线视频到音频生成系统

SoundReactor: Frame-level Online Video-to-Audio Generation

October 2, 2025
作者: Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji
cs.AI

摘要

当前主流的视频到音频(V2A)生成模型采用离线模式运行,假设整个视频序列或帧块已预先提供。这一特性严重限制了它们在互动应用中的使用,如实时内容创作和新兴的生成式世界模型。为填补这一空白,我们引入了帧级在线V2A生成这一新颖任务,其中模型无需访问未来视频帧即可自回归地生成音频。此外,我们提出了SoundReactor,据我们所知,这是首个专为此任务设计且简单高效的框架。我们的设计确保了端到端的因果性,并致力于实现低每帧延迟与音视频同步。模型的核心是一个仅解码器的因果变换器,作用于连续的音频潜在空间。在视觉条件方面,它利用了从DINOv2视觉编码器最小变体中提取的网格(补丁)特征,这些特征每帧聚合为单一令牌,以保持端到端的因果性和效率。模型通过扩散预训练和一致性微调进行训练,以加速扩散头解码。在基于AAA级游戏视频的多样化基准测试中,我们的模型成功生成了语义和时间上对齐的高质量全频段立体声音频,并通过客观和人类评估验证。此外,在30FPS、480p视频上,使用单块H100显卡,我们的模型实现了低每帧波形级延迟(NFE=1时为26.3ms,NFE=4时为31.5ms)。演示样本可在https://koichi-saito-sony.github.io/soundreactor/获取。
English
Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.
PDF22October 6, 2025