基于生成先验的可控人像关键帧插值
Controllable Human-centric Keyframe Interpolation with Generative Prior
June 3, 2025
作者: Zujin Guo, Size Wu, Zhongang Cai, Wei Li, Chen Change Loy
cs.AI
摘要
现有的插值方法利用预训练的视频扩散先验,在稀疏采样的关键帧之间生成中间帧。然而,在缺乏三维几何引导的情况下,这些方法难以对复杂、关节化的人体运动生成合理结果,且对合成动态的控制有限。本文提出PoseFuse3D关键帧插值器(PoseFuse3D-KI),这是一种新颖的框架,它将三维人体引导信号融入扩散过程,实现可控的人体中心关键帧插值(CHKI)。为了为插值提供丰富的空间和结构线索,我们的PoseFuse3D——一个三维信息控制模型,引入了一种新颖的SMPL-X编码器,将三维几何和形状转换为二维潜在条件空间,并配备了一个融合网络,将这些三维线索与二维姿态嵌入相结合。为了评估,我们构建了CHKI-Video,这是一个标注了二维姿态和三维SMPL-X参数的新数据集。实验表明,PoseFuse3D-KI在CHKI-Video上持续超越最先进的基线方法,PSNR提升了9%,LPIPS降低了38%。全面的消融实验证实,我们的PoseFuse3D模型显著提高了插值的保真度。
English
Existing interpolation methods use pre-trained video diffusion priors to
generate intermediate frames between sparsely sampled keyframes. In the absence
of 3D geometric guidance, these methods struggle to produce plausible results
for complex, articulated human motions and offer limited control over the
synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe
Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human
guidance signals into the diffusion process for Controllable Human-centric
Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for
interpolation, our PoseFuse3D, a 3D-informed control model, features a novel
SMPL-X encoder that transforms 3D geometry and shape into the 2D latent
conditioning space, alongside a fusion network that integrates these 3D cues
with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset
annotated with both 2D poses and 3D SMPL-X parameters. We show that
PoseFuse3D-KI consistently outperforms state-of-the-art baselines on
CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS.
Comprehensive ablations demonstrate that our PoseFuse3D model improves
interpolation fidelity.Summary
AI-Generated Summary