ReImagine: 画像ファースト合成による制御可能な高品質人物動画生成の再考

要旨

限られたマルチビューデータ条件下での人物外観・動作・カメラ視点の統合的モデリングは困難であり、人物動画生成は依然として挑戦的な課題である。既存手法ではこれらの要素を個別に扱うことが多く、制御性の限界や画質の低下を招いていた。本研究では、高品質な人物外観を画像生成によって学習し、動画合成の事前知識として活用する「画像優先」の視点からこの問題を再検討する。これにより、外観モデリングと時間的一貫性のデカップリングを実現する。具体的には、学習済み画像バックボーンとSMPL-Xベースの動作ガイダンスを組み合わせた姿勢・視点制御可能なパイプラインを提案し、さらに学習済みビデオ拡散モデルに基づく訓練不要の時間的リファインメント段階を設ける。本手法は多様な姿勢と視点において、高品質で時間的一貫性のある動画を生成する。また、正準的人物データセットと合成的な人物画像生成のための補助モデルを公開する。コードとデータはhttps://github.com/Taited/ReImagine で公開されている。

English

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

ReImagine: 画像ファースト合成による制御可能な高品質人物動画生成の再考

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

要旨

Support