포스 프롬프팅: 비디오 생성 모델이 물리 기반 제어 신호를 학습하고 일반화할 수 있다

초록

최근 비디오 생성 모델의 발전으로 현실적인 환경을 시뮬레이션할 수 있는 세계 모델에 대한 관심이 높아지고 있습니다. 내비게이션 분야는 잘 연구되어 왔지만, 실제 세계의 힘을 모방하는 물리적으로 의미 있는 상호작용은 여전히 크게 연구되지 않고 있습니다. 본 연구에서는 물리적 힘을 비디오 생성을 위한 제어 신호로 사용하는 방법을 탐구하고, 식물을 찌르는 것과 같은 지역적 점 힘과 천에 바람이 부는 것과 같은 전역적 바람 힘장을 통해 사용자가 이미지와 상호작용할 수 있도록 하는 힘 프롬프트를 제안합니다. 우리는 이러한 힘 프롬프트가 원래 사전 학습된 모델의 시각적 및 운동적 사전 지식을 활용하여 추론 시 3D 자산이나 물리 시뮬레이터를 사용하지 않고도 비디오가 물리적 제어 신호에 현실적으로 반응할 수 있음을 보여줍니다. 힘 프롬프트의 주요 과제는 실제 세계에서는 힘 신호를 얻기 어렵고, 합성 데이터에서는 물리 시뮬레이터의 시각적 품질과 도메인 다양성의 한계로 인해 고품질의 힘-비디오 쌍 학습 데이터를 얻기 어렵다는 점입니다. 우리의 주요 발견은 비디오 생성 모델이 Blender로 합성된 비디오에서 물리적 힘 조건을 따르도록 적응할 때, 소수의 객체에 대한 제한된 데모만으로도 놀라울 정도로 잘 일반화할 수 있다는 것입니다. 우리의 방법은 다양한 기하학, 설정 및 재료에 걸쳐 힘을 시뮬레이션하는 비디오를 생성할 수 있습니다. 또한 우리는 이러한 일반화의 원인을 이해하고, 시각적 다양성과 훈련 중 특정 텍스트 키워드 사용이라는 두 가지 핵심 요소를 밝히는 절제 실험을 수행합니다. 우리의 접근 방식은 4개의 A100 GPU에서 단 하루 동안 약 15,000개의 학습 예제만으로 훈련되었으며, 힘 준수와 물리적 현실성 측면에서 기존 방법을 능가하여 세계 모델을 실제 세계의 물리적 상호작용에 더 가깝게 만듭니다. 우리는 프로젝트 페이지에서 모든 데이터셋, 코드, 가중치 및 인터랙티브 비디오 데모를 공개합니다.

English

Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

포스 프롬프팅: 비디오 생성 모델이 물리 기반 제어 신호를 학습하고 일반화할 수 있다

Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

초록

Support