TrailBlazer：擴散式視頻生成的軌跡控制

摘要

在最近的文本到視頻（T2V）生成方法中，實現合成視頻的可控性通常是一個挑戰。通常，這個問題是通過提供低層次的逐幀指導，如邊緣地圖、深度地圖或現有視頻以供修改來解決的。然而，獲取這樣的指導可能需要大量的勞動。本文旨在通過使用簡單的邊界框在不需要神經網絡訓練、微調、推斷時優化或使用預先存在的視頻的情況下，增強視頻合成中的可控性。我們的算法TrailBlazer是建立在預訓練的（T2V）模型之上，易於實施。通過提出的空間和時間關注地圖編輯，主題通過邊界框進行引導。此外，我們引入了關鍵幀的概念，允許主題軌跡和整體外觀通過移動的邊界框和相應提示來引導，而無需提供詳細的遮罩。該方法效率高，與基礎預訓練模型相比，額外的計算幾乎可以忽略不計。儘管邊界框引導的簡單性，但所得到的運動出奇地自然，出現的效果包括隨著框大小增加而朝向虛擬攝像機的透視和運動。

English

Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos. Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.

TrailBlazer：擴散式視頻生成的軌跡控制

TrailBlazer: Trajectory Control for Diffusion-Based Video Generation

摘要

Support