视频Foley：通过时间事件条件的两阶段视频到声音生成，用于Foley声音

摘要

混合音合成对多媒体制作至关重要，通过在时间和语义上同步音频和视频，增强用户体验。最近关于通过视频生成音频自动化这一劳动密集型过程的研究面临着重大挑战。缺乏明确时间特征的系统容易导致控制性和对齐性不佳，而基于时间戳的模型则需要昂贵且主观的人工标注。我们提出了Video-Foley，这是一个使用均方根（RMS）作为时间事件条件的视频到音频系统，配合语义音色提示（音频或文本）。RMS是一个与音频语义密切相关的帧级强度包络特征，确保了高度的可控性和同步性。这种无需注释的自监督学习框架包括两个阶段，Video2RMS 和 RMS2Sound，融合了包括RMS离散化和带有预训练文本到音频模型的RMS-ControlNet 在内的新颖思想。我们进行了广泛的评估，结果显示Video-Foley 在声音时间、强度、音色和细微差别的音频-视觉对齐和可控性方面取得了最先进的性能。代码、模型权重和演示可在附带网站上找到。（https://jnwnlee.github.io/video-foley-demo）

English

Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor controllability and alignment, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope feature closely related to audio semantics, ensures high controllability and synchronization. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Code, model weights, and demonstrations are available on the accompanying website. (https://jnwnlee.github.io/video-foley-demo)

视频Foley：通过时间事件条件的两阶段视频到声音生成，用于Foley声音

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

摘要

Support