視覚的アクションプロンプトによる精密なアクション-ビデオ生成

要旨

我々は、複雑な高自由度インタラクションのアクションからビデオ生成において、ドメイン間で転移可能な視覚的ダイナミクスを維持しつつ、統一的なアクション表現である視覚的アクションプロンプトを提案する。アクション駆動型ビデオ生成は、精度と汎用性のトレードオフに直面している：テキスト、プリミティブアクション、または粗いマスクを使用する既存の手法は汎用性を提供するが精度に欠け、一方でエージェント中心のアクション信号は精度を提供するがドメイン間の転移性を犠牲にする。アクションの精度とダイナミクスの転移性をバランスさせるため、我々はアクションを正確な視覚的プロンプトとして「レンダリング」し、幾何学的精度とドメイン間適応性を維持するドメイン非依存の表現として提案する。具体的には、汎用性とアクセシビリティの観点から視覚的スケルトンを選択する。我々は、人間と物体のインタラクション（HOI）と器用なロボット操作という2つのインタラクション豊富なデータソースからスケルトンを構築する堅牢なパイプラインを提案し、アクション駆動型生成モデルのドメイン間学習を可能にする。事前学習済みのビデオ生成モデルに視覚的スケルトンを軽微なファインチューニングで統合することで、複雑なインタラクションの正確なアクション制御を可能にしつつ、ドメイン間ダイナミクスの学習を維持する。EgoVid、RT-1、DROIDでの実験により、提案手法の有効性を実証する。プロジェクトページ：https://zju3dv.github.io/VAP/。

English

We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.

視覚的アクションプロンプトによる精密なアクション-ビデオ生成

Precise Action-to-Video Generation Through Visual Action Prompts

要旨

Support