ReImagine: 이미지 우선 합성을 통한 제어 가능한 고품질 인간 비디오 생성 재고찰

초록

제한된 다중 시점 데이터 환경에서 인간의 외관, 동작, 카메라 시점을 함께 모델링하는 어려움으로 인해 인간 비디오 생성은 여전히 과제로 남아 있습니다. 기존 방법들은 이러한 요소들을 별도로 처리하는 경향이 있어 제어 가능성이 제한되거나 시각적 품질이 저하됩니다. 본 연구는 고품질 인간 외관을 이미지 생성으로 학습하여 비디오 합성을 위한 사전 지식으로 활용함으로써 외관 모델링과 시간적 일관성 문제를 분리하는 '이미지 우선(image-first)' 관점에서 이 문제를 재조명합니다. 우리는 사전 학습된 이미지 백본과 SMPL-X 기반 동작 안내를 결합하고, 사전 학습된 비디오 확산 모델을 기반으로 하는 학습이 필요 없는 시간적 정제 단계를 추가한, 포즈 및 시점 제어가 가능한 파이프라인을 제안합니다. 우리의 방법은 다양한 포즈와 시점에서 고품질이며 시간적으로 일관된 비디오를 생성합니다. 또한 표준 인간 데이터셋과 조합적 인간 이미지 합성을 위한 보조 모델을 공개합니다. 코드와 데이터는 https://github.com/Taited/ReImagine에서 공개되어 있습니다.

English

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

ReImagine: 이미지 우선 합성을 통한 제어 가능한 고품질 인간 비디오 생성 재고찰

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

초록

Support