Boximator：为视频合成生成丰富且可控的运动

摘要

在视频合成中，生成丰富且可控的运动是一个关键挑战。我们提出了一种名为Boximator的新方法，用于精细控制运动。Boximator引入了两种约束类型：硬盒和软盒。用户使用硬盒在条件帧中选择对象，然后使用任一类型的盒子在未来帧中粗略或严格地定义对象的位置、形状或运动路径。Boximator作为现有视频扩散模型的插件运行。其训练过程通过冻结原始权重并仅训练控制模块来保留基础模型的知识。为解决训练挑战，我们引入了一种新颖的自我跟踪技术，极大简化了盒子-对象相关性的学习。从经验上看，Boximator实现了最先进的视频质量（FVD）分数，在两个基础模型的基础上有所改进，并在融合盒约束后进一步提升。其强大的运动可控性通过边界框对齐度量的显著增加得到验证。人类评估还表明，用户更青睐于Boximator生成结果，而非基础模型。

English

Generating rich and controllable motion is a pivotal challenge in video synthesis. We propose Boximator, a new approach for fine-grained motion control. Boximator introduces two constraint types: hard box and soft box. Users select objects in the conditional frame using hard boxes and then use either type of boxes to roughly or rigorously define the object's position, shape, or motion path in future frames. Boximator functions as a plug-in for existing video diffusion models. Its training process preserves the base model's knowledge by freezing the original weights and training only the control module. To address training challenges, we introduce a novel self-tracking technique that greatly simplifies the learning of box-object correlations. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints. Its robust motion controllability is validated by drastic increases in the bounding box alignment metric. Human evaluation also shows that users favor Boximator generation results over the base model.

Boximator：为视频合成生成丰富且可控的运动

Boximator: Generating Rich and Controllable Motions for Video Synthesis

摘要

Support