透過視覺動作提示實現精確的動作到視頻生成

摘要

我們提出了視覺動作提示（visual action prompts），這是一種統一的動作表示方法，用於生成複雜高自由度交互的動作到視頻轉換，同時保持跨領域的可遷移視覺動態。動作驅動的視頻生成面臨著精確性與通用性之間的權衡：現有方法使用文本、基本動作或粗略掩碼提供了通用性但缺乏精確性，而以代理為中心的動作信號則以跨領域遷移能力為代價提供了精確性。為了平衡動作精確性和動態遷移能力，我們提出將動作“渲染”成精確的視覺提示，作為領域無關的表示，這些表示既保留了幾何精確性，又具備複雜動作的跨領域適應性；具體而言，我們選擇了視覺骨架，因其通用性和易獲取性。我們提出了穩健的流程，從兩個交互豐富的數據源——人與物體交互（HOI）和靈巧的機器人操作——構建骨架，從而實現動作驅動生成模型的跨領域訓練。通過輕量級微調將視覺骨架整合到預訓練的視頻生成模型中，我們能夠精確控制複雜交互的動作，同時保留跨領域動態的學習。在EgoVid、RT-1和DROID上的實驗證明了我們所提出方法的有效性。項目頁面：https://zju3dv.github.io/VAP/。

English

We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.

透過視覺動作提示實現精確的動作到視頻生成

Precise Action-to-Video Generation Through Visual Action Prompts

摘要

Support