フォーリー制御：凍結された潜在的特徴を用いた音声生成モデルの映像への適合

要旨

Foley Controlは、事前学習済みの単一モダリティモデルを凍結したまま、それらの間の小さなクロスアテンションブリッジのみを学習する、軽量なビデオ誘導Foley手法です。我々は、V-JEPAのビデオ埋め込みを、凍結されたStable Audio Open DiTテキスト-to-オーディオ(T2A)モデルに接続します。これは、モデルが既に持つテキストクロスアテンションの後にコンパクトなビデオクロスアテンションを挿入することで実現し、プロンプトが大域的な意味を設定する一方で、ビデオがタイミングと局所的なダイナミクスを洗練させます。凍結されたバックボーンは強力な周辺分布（ビデオ；テキストが与えられた時のオーディオ）を保持し、ブリッジは同期に必要なオーディオ-ビデオ間の依存関係を学習します——オーディオの事前分布を再学習することなく。メモリ削減と訓練の安定化のために、条件付けの前にビデオトークンをプーリングします。厳選されたビデオ-オーディオベンチマークにおいて、Foley Controlは、最近のマルチモーダルシステムよりもはるかに少ない学習可能パラメータ数で、競争力のある時間的・意味的アライメントを実現しつつ、プロンプト駆動の制御性と制作に適したモジュール性（エンドツーエンドの再学習なしにエンコーダやT2Aバックボーンの交換/アップグレードが可能）を保持します。我々はVideo-to-Foleyに焦点を当てていますが、同じブリッジ設計は他のオーディオモダリティ（例：音声）へも拡張可能です。

English

Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).