ビデオフーリー：フーリーサウンドのための時間イベント条件を介した2段階ビデオからサウンド生成

要旨

マルチメディア制作において、フォーリー音合成は重要であり、音声と映像を時間的、意味的に同期させることでユーザーエクスペリエンスを向上させます。最近の研究では、この労力を要するプロセスをビデオから音声への生成を通じて自動化しようとする取り組みが重要な課題に直面しています。明示的な時間的特徴を欠いたシステムは、コントロール性とアライメントの面で問題を抱えており、タイムスタンプベースのモデルはコストがかかり主観的な人間の注釈が必要です。私たちは、Root Mean Square（RMS）を時間的イベント条件とし、意味的な音色プロンプト（音声またはテキスト）を使用するビデオから音声へのシステムであるVideo-Foleyを提案します。RMSは、オーディオの意味に密接に関連するフレームレベルの強度エンベロープ特徴であり、高いコントロール性と同期性を確保します。注釈不要の自己教師付き学習フレームワークは、Video2RMSとRMS2Soundの2つの段階で構成されており、RMSの離散化や事前学習されたテキストからオーディオへのモデルを組み込んだRMS-ControlNetなどの新しいアイデアが取り入れられています。私たちの包括的な評価によると、Video-Foleyは音のタイミング、強度、音色、ニュアンスにおける音声と視覚のアライメントとコントロール性において最先端のパフォーマンスを達成しています。コード、モデルの重み、デモは、関連するウェブサイトで入手可能です。（https://jnwnlee.github.io/video-foley-demo）

English

Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor controllability and alignment, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope feature closely related to audio semantics, ensures high controllability and synchronization. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Code, model weights, and demonstrations are available on the accompanying website. (https://jnwnlee.github.io/video-foley-demo)

ビデオフーリー：フーリーサウンドのための時間イベント条件を介した2段階ビデオからサウンド生成

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

要旨

Support