ReImagine：通过图像优先合成重新构想可控高质量人类视频生成

摘要

受限於多視圖數據不足，人體外觀、動作與相機視角的聯合建模始終是人物視頻生成領域的難點。現有方法常將這些要素分離處理，導致可控性受限或視覺品質下降。我們從圖像優先的視角重新審視該問題：通過圖像生成學習高質量人體外觀作為視頻合成的先驗知識，從而將外觀建模與時序一致性解耦。本文提出結合預訓練圖像骨幹與SMPL-X運動指導的姿態-視點可控流程，並基於預訓練視頻擴散模型實現免訓練的時序優化階段。我們的方法能在多樣化姿態與視角下生成高質量、時序連貫的視頻。同時發布包含規範人體數據集與組合式人體圖像合成輔助模型。代碼與數據已開源於https://github.com/Taited/ReImagine。

English

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

ReImagine：通过图像优先合成重新构想可控高质量人类视频生成

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

摘要

Support