LooseControlVideo：使用空间分块的导演式视频控制

摘要

在文本生成视频中，精确的3D空间编排仍然是一个重大挑战，尤其是在多物体场景中，语义布局与时间动态往往相互纠缠。虽然现有的深度条件模型能够实现良好的结构保真度，但它们需要密集且帧级精确的指导，对于涉及可变形物体的动态事件而言，这种指导的创作极为耗时。我们提出LooseControlVideo框架，通过使用稀疏定向3D框作为“阻挡”代理，实现直观且富有表现力的控制。这使得用户能够创作高级布局和轨迹，同时利用视频生成模型生成逼真的遮挡、动态和交互。我们通过在标注有DNOCS（一种针对3D尺寸、方向和深度顺序遮挡的新型编码）的视频数据集上微调Wan 2.2骨干网络来实现这一点。此外，我们的方法允许局部细化，例如调整跳跃轨迹或添加交互，且对全局场景上下文的干扰极小。在nuScenes、HO-3D和BEHAVE基准上的广泛评估表明，LooseControlVideo显著优于现有的2D框和基于流的基线。我们的研究结果表明，与当前最先进的布局条件模型相比，轨迹误差改善了1.2到3倍，刚性运动一致性提高了2倍，遮挡准确性提升了1.5到2倍，这表明定向3D基元为复杂的多智能体视频创作提供了良好的几何先验。

English

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.