重塑思路:通过图像优先合成实现可控高质量人类视频生成的新探索
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
April 21, 2026
作者: Zhengwentai Sun, Keru Zheng, Chenghong Li, Hongjie Liao, Xihe Yang, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han
cs.AI
摘要
由于在有限的多视角数据下难以同时建模人体外观、运动轨迹和相机视角,人类视频生成仍面临挑战。现有方法通常分别处理这些因素,导致可控性受限或视觉质量下降。我们以图像优先的视角重新审视该问题:通过图像生成学习高质量人体外观,并将其作为视频合成的先验知识,从而将外观建模与时序一致性解耦。我们提出一种姿态与视角可控的流程,该流程结合预训练图像主干网络与基于SMPL-X的运动引导,并引入基于预训练视频扩散模型的免训练时序优化阶段。我们的方法能在多样化姿态和视角下生成高质量、时序连贯的视频。同时,我们发布了标准人体数据集及用于组合式人体图像合成的辅助模型。代码与数据已公开于https://github.com/Taited/ReImagine。
English
Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.