폴리 컨트롤: 고정된 잠재 텍스트-음향 모델을 비디오에 정렬하기

초록

폴리 컨트롤은 사전 학습된 단일 모달리티 모델을 동결(frozen) 상태로 유지하고 그 사이의 소형 교차 주의력(cross-attention) 브리지만 학습하는, 경량화된 비디오 기반 폴리 사운드 생성 방식입니다. 우리는 V-JEPA2 비디오 임베딩을 동결된 Stable Audio Open DiT 텍스트-음향(T2A) 모델에 연결하기 위해, 모델의 기존 텍스트 교차 주의력 층 다음에 컴팩트한 비디오 교차 주의력 층을 삽입합니다. 이를 통해 프롬프트는 전역적 의미를 설정하고 비디오는 타이밍과 지역적 동역학을 세부 조정합니다. 동결된 백본은 강력한 주변 분포(비디오; 텍스트가 주어진 음향)를 유지하며, 브리지는 동기화에 필요한 음향-비디오 의존성을 학습합니다. 이때 음향 사전 분포(audio prior)를 재학습할 필요가 없습니다. 메모리 사용량을 절감하고 학습을 안정화하기 위해, 우리는 조건부 입력 전에 비디오 토큰을 풀링(pooling)합니다. 정제된 비디오-음향 벤치마크에서 폴리 컨트롤은 최근 다중 모달 시스템 대비 훨씬 적은 학습 매개변수로 경쟁력 있는 시간적 및 의미론적 정렬 성능을 제공하며, 프롬프트 기반 제어성과 제작 친화적인 모듈성(인코더나 T2A 백본을 엔드-투-엔드 재학습 없이 교체/업그레이드 가능)을 유지합니다. 비록 우리가 비디오-폴리 변환에 초점을 맞추고 있지만, 동일한 브리지 설계는 다른 음향 모달리티(예: 음성)로도 확장 적용될 수 있습니다.

English

Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).