重現任何事物：使用運動文本反轉進行語義視頻運動轉移

摘要

近年來，影片生成和編輯方法的質量有了巨大的改善。雖然有幾種技術專注於編輯外觀，但很少涉及運動。目前使用文字、軌跡或邊界框的方法僅限於簡單的運動，因此我們提出使用單個運動參考影片來指定運動。我們進一步建議使用預先訓練的圖像到影片模型，而不是文字到影片模型。這種方法使我們能夠保留目標物體或場景的確切外觀和位置，有助於將外觀與運動分離。我們的方法名為運動文本反轉，利用我們的觀察，即圖像到影片模型主要從（潛在的）圖像輸入中提取外觀，而通過交叉注意力注入的文本/圖像嵌入主要控制運動。因此，我們使用文本/圖像嵌入標記來表示運動。通過在每個幀中包含多個文本/圖像嵌入標記的膨脹運動文本嵌入上操作，我們實現了高時間運動細微度。優化過運動參考影片後，這種嵌入可以應用於各種目標圖像，以生成具有語義相似運動的影片。我們的方法不需要運動參考影片和目標圖像之間的空間對齊，可以在各種領域之間進行泛化，並且可以應用於各種任務，如全身和臉部再現，以及控制無生命物體和攝像機的運動。我們在語義視頻運動轉移任務中實證證明了我們方法的有效性，在這一情境中明顯優於現有方法。

English

Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.

重現任何事物：使用運動文本反轉進行語義視頻運動轉移

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

摘要

Support