基於生成先驗的可控人體關鍵幀插值
Controllable Human-centric Keyframe Interpolation with Generative Prior
June 3, 2025
作者: Zujin Guo, Size Wu, Zhongang Cai, Wei Li, Chen Change Loy
cs.AI
摘要
现有的插值方法利用预训练的视频扩散先验,在稀疏采样的关键帧之间生成中间帧。在缺乏三维几何引导的情况下,这些方法难以对复杂、关节化的人体运动产生合理的结果,并且对合成动态的控制有限。本文中,我们提出了PoseFuse3D关键帧插值器(PoseFuse3D-KI),这是一个新颖的框架,它将三维人体引导信号整合到扩散过程中,以实现可控的人体中心关键帧插值(CHKI)。为了为插值提供丰富的空间和结构线索,我们的PoseFuse3D,一个三维信息控制模型,具备一个新颖的SMPL-X编码器,将三维几何和形状转换为二维潜在条件空间,以及一个融合网络,将这些三维线索与二维姿态嵌入相结合。为了评估,我们构建了CHKI-Video,一个标注有二维姿态和三维SMPL-X参数的新数据集。我们展示了PoseFuse3D-KI在CHKI-Video上始终优于最先进的基线,PSNR提高了9%,LPIPS减少了38%。全面的消融实验证明,我们的PoseFuse3D模型提高了插值的保真度。
English
Existing interpolation methods use pre-trained video diffusion priors to
generate intermediate frames between sparsely sampled keyframes. In the absence
of 3D geometric guidance, these methods struggle to produce plausible results
for complex, articulated human motions and offer limited control over the
synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe
Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human
guidance signals into the diffusion process for Controllable Human-centric
Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for
interpolation, our PoseFuse3D, a 3D-informed control model, features a novel
SMPL-X encoder that transforms 3D geometry and shape into the 2D latent
conditioning space, alongside a fusion network that integrates these 3D cues
with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset
annotated with both 2D poses and 3D SMPL-X parameters. We show that
PoseFuse3D-KI consistently outperforms state-of-the-art baselines on
CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS.
Comprehensive ablations demonstrate that our PoseFuse3D model improves
interpolation fidelity.