從單一影片中實現動態概念個性化

摘要

生成式文本到圖像模型的個性化已取得顯著進展，但將這種個性化擴展到文本到視頻模型則面臨獨特的挑戰。與靜態概念不同，個性化文本到視頻模型有潛力捕捉動態概念，即不僅由外觀定義，還由其運動定義的實體。本文介紹了Set-and-Sequence，這是一種新穎的框架，用於基於擴散變換器（DiTs）的生成視頻模型來個性化動態概念。我們的方法在一個不顯式分離空間和時間特徵的架構中施加了時空權重空間。這通過兩個關鍵階段實現。首先，我們使用視頻中的無序幀集微調低秩適應（LoRA）層，以學習代表外觀的身份LoRA基礎，不受時間干擾。在第二階段，我們在身份LoRA凍結的情況下，通過運動殘差增強其係數，並在完整視頻序列上進行微調，捕捉運動動態。我們的Set-and-Sequence框架產生了一個時空權重空間，有效地將動態概念嵌入到視頻模型的輸出域中，實現了前所未有的可編輯性和組合性，同時為個性化動態概念設定了新的基準。

English

Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.

從單一影片中實現動態概念個性化

Dynamic Concepts Personalization from Single Videos

摘要

Support