ChatPaper.aiChatPaper

通过视觉动作提示实现精准动作到视频生成

Precise Action-to-Video Generation Through Visual Action Prompts

August 18, 2025
作者: Yuang Wang, Chao Wen, Haoyu Guo, Sida Peng, Minghan Qin, Hujun Bao, Xiaowei Zhou, Ruizhen Hu
cs.AI

摘要

我们提出了视觉动作提示,这是一种统一的动作表示方法,用于生成复杂高自由度交互的动作到视频,同时保持跨领域的可迁移视觉动态特性。动作驱动的视频生成面临精度与通用性的权衡:现有方法使用文本、基础动作或粗略掩码虽具通用性但缺乏精度,而基于智能体的动作信号虽提供精度却牺牲了跨领域迁移能力。为平衡动作精度与动态迁移性,我们提出将动作“渲染”为精确的视觉提示,作为领域无关的表示,既保留几何精度又支持复杂动作的跨领域适应性;具体而言,我们选择视觉骨架因其通用性与易获取性。我们设计了稳健的流程,从两类交互丰富的数据源——人-物交互(HOI)和灵巧机器人操作——构建骨架,支持动作驱动生成模型的跨领域训练。通过轻量级微调将视觉骨架整合至预训练的视频生成模型中,我们实现了对复杂交互的精确动作控制,同时保留了跨领域动态的学习能力。在EgoVid、RT-1和DROID上的实验验证了所提方法的有效性。项目页面:https://zju3dv.github.io/VAP/。
English
We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.
PDF112August 19, 2025