시각적 액션 프롬프트를 통한 정밀한 액션-투-비디오 생성

초록

복잡한 높은 자유도(DoF) 상호작용의 동영상 생성과 도메인 간 전이 가능한 시각적 역학을 유지하기 위한 통합된 액션 표현인 시각적 액션 프롬프트를 제안합니다. 액션 기반 동영상 생성은 정밀성과 일반성 간의 트레이드오프에 직면해 있습니다: 기존의 텍스트, 기본 액션, 또는 거친 마스크를 사용하는 방법들은 일반성을 제공하지만 정밀성이 부족한 반면, 에이전트 중심의 액션 신호는 정밀성을 제공하지만 도메인 간 전이 가능성이 떨어집니다. 액션의 정밀성과 역학적 전이 가능성의 균형을 맞추기 위해, 우리는 액션을 정확한 시각적 프롬프트로 "렌더링"하여 복잡한 액션에 대한 기하학적 정밀성과 도메인 간 적응성을 모두 보존하는 도메인-불특정 표현으로 제안합니다; 특히, 일반성과 접근성을 고려하여 시각적 스켈레톤을 선택했습니다. 우리는 인간-객체 상호작용(HOI)과 민첩한 로봇 조작이라는 두 가지 상호작용이 풍부한 데이터 소스로부터 스켈레톤을 구성하는 강력한 파이프라인을 제안하여, 액션 기반 생성 모델의 도메인 간 학습을 가능하게 합니다. 사전 학습된 동영상 생성 모델에 시각적 스켈레톤을 경량 미세 조정을 통해 통합함으로써, 복잡한 상호작용의 정밀한 액션 제어를 가능하게 하면서도 도메인 간 역학 학습을 보존합니다. EgoVid, RT-1 및 DROID에 대한 실험을 통해 우리가 제안한 접근 방식의 효과를 입증합니다. 프로젝트 페이지: https://zju3dv.github.io/VAP/.

English

We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.

시각적 액션 프롬프트를 통한 정밀한 액션-투-비디오 생성

Precise Action-to-Video Generation Through Visual Action Prompts

초록

Support