AVControl:高效訓練音視覺控制框架
AVControl: Efficient Framework for Training Audio-Visual Controls
March 25, 2026
作者: Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain, Asaf Joseph, Anthony Chen, Urska Jelercic, Ofir Bibi
cs.AI
摘要
控制影片與音訊生成需要多樣化的模態,從深度和姿態到攝影機軌跡和音訊轉換,然而現有方法要麼針對固定控制集訓練單一整體模型,要麼為每個新模態引入昂貴的架構修改。我們提出AVControl——一個基於聯合視聽基礎模型LTX-2建構的輕量可擴展框架,其中每個控制模態作為獨立LoRA在平行畫布上訓練,該畫布透過注意力層中的附加令牌提供參考信號,除LoRA適配器外無需任何架構修改。我們證明,僅將基於圖像的上下文方法擴展到影片會導致結構控制失效,而我們的平行畫布方法能解決此問題。在VACE基準測試中,我們在深度與姿態引導生成、修補與擴充任務上超越所有評估基線,並在攝影機控制與視聽基準展現競爭力成果。我們的框架支援多種獨立訓練的模態:空間對齊控制(如深度、姿態、邊緣)、含內參的攝影機軌跡、稀疏運動控制、影片編輯,以及據我們所知首個針對聯合生成模型的模組化視聽控制。本方法兼具運算與資料效率:每個模態僅需小型資料集,並在數百至數千訓練步數內收斂,成本僅為整體式方案的零頭。我們公開釋出程式碼與訓練完成的LoRA檢查點。
English
Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.