TrackGo：一种灵活高效的可控视频生成方法

摘要

近年来，在基于扩散的可控视频生成方面取得了实质性进展。然而，在复杂场景中实现精确控制，包括细粒度对象部分、复杂运动轨迹和连贯背景移动，仍然是一个挑战。本文介绍了TrackGo，这是一种新颖方法，利用自由形式的蒙版和箭头进行有条件的视频生成。这种方法为用户提供了一种灵活且精确的机制来操纵视频内容。我们还提出了TrackAdapter用于控制实现，这是一种高效且轻量级的适配器，旨在无缝集成到预训练视频生成模型的时间自注意力层中。这种设计利用了我们的观察结果，即这些层的注意力图可以准确激活与视频中运动对应的区域。我们的实验结果表明，我们的新方法在FVD、FID和ObjMC等关键指标上通过TrackAdapter的增强实现了最先进的性能。TrackGo的项目页面可在以下网址找到：https://zhtjtcz.github.io/TrackGo-Page/

English

Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores. The project page of TrackGo can be found at: https://zhtjtcz.github.io/TrackGo-Page/

TrackGo：一种灵活高效的可控视频生成方法

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

摘要

Support