ChatPaper.aiChatPaper

AVControl:高效训练音视频控制框架

AVControl: Efficient Framework for Training Audio-Visual Controls

March 25, 2026
作者: Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain, Asaf Joseph, Anthony Chen, Urska Jelercic, Ofir Bibi
cs.AI

摘要

控制视频与音频生成需要多样化的模态支持,从深度信息、姿态到相机轨迹和音频变换,然而现有方法要么针对固定控制集训练单一整体模型,要么为每种新模态引入昂贵的架构修改。我们提出AVControl——一个基于联合音视频基础模型LTX-2构建的轻量可扩展框架,其中每种控制模态作为独立的LoRA模块在并行画布上进行训练。该画布通过注意力层中的附加令牌提供参考信号,除LoRA适配器外无需任何架构改动。我们证明,简单将基于图像的上下文方法扩展到视频会因结构性控制而失效,而我们的并行画布方法能有效解决这一问题。在VACE基准测试中,我们在深度/姿态引导生成、修复和外绘任务上超越所有基线模型,并在相机控制与音视频基准测试中展现出竞争力。我们的框架支持多种独立训练的模态:空间对齐控制(如深度、姿态、边缘)、含内参的相机轨迹、稀疏运动控制、视频编辑,以及业界首个面向联合生成模型的模块化音视频控制。该方法兼具计算与数据效率:每种模态仅需小型数据集,在数百至数千训练步数内收敛,所需资源仅为整体式方案的零头。我们公开了代码与训练好的LoRA检查点。
English
Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.
PDF101March 28, 2026