ReVision: 복잡한 움직임과 상호작용을 위한 명시적 3D 물리 모델링을 통한 고품질, 저비용 비디오 생성

초록

최근 몇 년간 비디오 생성 기술은 상당한 발전을 이루었습니다. 그러나 여전히 복잡한 동작과 상호작용을 생성하는 데는 어려움이 남아 있습니다. 이러한 문제를 해결하기 위해, 우리는 ReVision이라는 플러그 앤 플레이 프레임워크를 소개합니다. 이 프레임워크는 사전 훈련된 조건부 비디오 생성 모델에 파라미터화된 3D 물리 지식을 명시적으로 통합하여, 복잡한 동작과 상호작용이 포함된 고품질 비디오를 생성하는 능력을 크게 향상시킵니다. 구체적으로, ReVision은 세 단계로 구성됩니다. 먼저, 비디오 확산 모델을 사용하여 초기 비디오를 생성합니다. 다음으로, 이 초기 비디오에서 2D 및 3D 특징을 추출하여 3D 객체 중심 표현을 구성하고, 이를 우리가 제안한 파라미터화된 물리 사전 모델로 정제하여 정확한 3D 동작 시퀀스를 생성합니다. 마지막으로, 이 정제된 동작 시퀀스를 동일한 비디오 확산 모델에 추가 조건으로 피드백하여, 복잡한 동작과 상호작용이 포함된 시나리오에서도 동작 일관성이 있는 비디오를 생성할 수 있게 합니다. 우리는 Stable Video Diffusion에서 우리의 접근법의 효과를 검증했으며, ReVision이 동작 충실도와 일관성을 크게 개선함을 확인했습니다. 특히, 단 15억 개의 파라미터만으로도 130억 개 이상의 파라미터를 가진 최첨단 비디오 생성 모델을 복잡한 비디오 생성에서 상당한 차이로 능가했습니다. 우리의 결과는 3D 물리 지식을 통합함으로써, 상대적으로 작은 비디오 확산 모델도 더 큰 현실감과 제어 가능성을 가지고 복잡한 동작과 상호작용을 생성할 수 있음을 시사하며, 물리적으로 타당한 비디오 생성을 위한 유망한 해결책을 제시합니다.

English

In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D physical knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized physical prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D physical knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.