**拟音控制:将冻结的潜在文本-音频模型与视频对齐**
Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video
October 24, 2025
作者: Ciara Rowles, Varun Jampani, Simon Donné, Shimon Vainer, Julian Parker, Zach Evans
cs.AI
摘要
Foley Control是一种轻量级的视频引导拟音方法,该方法保持预训练的单模态模型参数冻结,仅学习模型间的小型交叉注意力桥接模块。我们通过将V-JEPA2视频嵌入与冻结的Stable Audio Open DiT文生音频模型相连接——在现有文本交叉注意力层后插入紧凑的视频交叉注意力层,使得文本提示设定全局语义,而视频则细化时间动态与局部特征。冻结的主干网络保留了强大的边缘分布特性(视频模态;给定文本时的音频分布),而桥接模块则学习了同步所需的音视频依赖关系,无需重新训练音频先验模型。为降低内存消耗并稳定训练,我们在条件化前对视频标记进行池化处理。在精选的音视频基准测试中,Foley Control以远少于当前多模态系统的可训练参数量,实现了具有竞争力的时序对齐与语义对齐效果,同时保留了提示驱动的可控性及生产友好的模块化特性(无需端到端重训练即可替换/升级编码器或文生音频主干)。尽管当前聚焦于视频拟音任务,该桥接设计同样具备扩展至其他音频模态(如语音)的潜力。
English
Foley Control is a lightweight approach to video-guided Foley that keeps
pretrained single-modality models frozen and learns only a small
cross-attention bridge between them. We connect V-JEPA2 video embeddings to a
frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact
video cross-attention after the model's existing text cross-attention, so
prompts set global semantics while video refines timing and local dynamics. The
frozen backbones retain strong marginals (video; audio given text) and the
bridge learns the audio-video dependency needed for synchronization -- without
retraining the audio prior. To cut memory and stabilize training, we pool video
tokens before conditioning. On curated video-audio benchmarks, Foley Control
delivers competitive temporal and semantic alignment with far fewer trainable
parameters than recent multi-modal systems, while preserving prompt-driven
controllability and production-friendly modularity (swap/upgrade encoders or
the T2A backbone without end-to-end retraining). Although we focus on
Video-to-Foley, the same bridge design can potentially extend to other audio
modalities (e.g., speech).