AVControl：高效訓練音視覺控制框架

摘要

控制影片與音訊生成需要多樣化的模態，從深度和姿態到攝影機軌跡和音訊轉換，然而現有方法要麼針對固定控制集訓練單一整體模型，要麼為每個新模態引入昂貴的架構修改。我們提出AVControl——一個基於聯合視聽基礎模型LTX-2建構的輕量可擴展框架，其中每個控制模態作為獨立LoRA在平行畫布上訓練，該畫布透過注意力層中的附加令牌提供參考信號，除LoRA適配器外無需任何架構修改。我們證明，僅將基於圖像的上下文方法擴展到影片會導致結構控制失效，而我們的平行畫布方法能解決此問題。在VACE基準測試中，我們在深度與姿態引導生成、修補與擴充任務上超越所有評估基線，並在攝影機控制與視聽基準展現競爭力成果。我們的框架支援多種獨立訓練的模態：空間對齊控制（如深度、姿態、邊緣）、含內參的攝影機軌跡、稀疏運動控制、影片編輯，以及據我們所知首個針對聯合生成模型的模組化視聽控制。本方法兼具運算與資料效率：每個模態僅需小型資料集，並在數百至數千訓練步數內收斂，成本僅為整體式方案的零頭。我們公開釋出程式碼與訓練完成的LoRA檢查點。

English

Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

AVControl：高效訓練音視覺控制框架

AVControl: Efficient Framework for Training Audio-Visual Controls

摘要

Support