ChatPaper.aiChatPaper

目標感知的視頻擴散模型

Target-Aware Video Diffusion Models

March 24, 2025
作者: Taeksoo Kim, Hanbyul Joo
cs.AI

摘要

我們提出了一種目標感知的視頻擴散模型,該模型從輸入圖像生成視頻,其中演員在執行指定動作時與特定目標進行互動。目標由分割掩碼定義,而期望的動作則通過文本提示描述。與現有的可控圖像到視頻擴散模型不同,這些模型通常依賴於密集的結構或運動線索來引導演員向目標移動,我們的目標感知模型僅需一個簡單的掩碼來指示目標,利用預訓練模型的泛化能力來生成合理的動作。這使得我們的方法特別適用於人-物交互(HOI)場景,在這些場景中提供精確的動作指導具有挑戰性,並且進一步使得視頻擴散模型能夠用於高級動作規劃,例如在機器人應用中。我們通過擴展基礎模型來構建目標感知模型,將目標掩碼作為額外輸入。為了強化目標感知,我們引入了一個特殊標記,該標記在文本提示中編碼目標的空間信息。然後,我們使用一種新穎的交叉注意力損失對模型進行微調,該損失將與此標記相關的交叉注意力圖與輸入目標掩碼對齊。為了進一步提高性能,我們選擇性地將此損失應用於語義最相關的變壓器塊和注意力區域。實驗結果表明,我們的目標感知模型在生成演員與指定目標準確互動的視頻方面優於現有解決方案。我們進一步展示了其在兩個下游應用中的有效性:視頻內容創作和零樣本3D HOI運動合成。
English
We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.

Summary

AI-Generated Summary

PDF52April 3, 2025