基于生成先验的可控人像关键帧插值

摘要

现有的插值方法利用预训练的视频扩散先验，在稀疏采样的关键帧之间生成中间帧。然而，在缺乏三维几何引导的情况下，这些方法难以对复杂、关节化的人体运动生成合理结果，且对合成动态的控制有限。本文提出PoseFuse3D关键帧插值器（PoseFuse3D-KI），这是一种新颖的框架，它将三维人体引导信号融入扩散过程，实现可控的人体中心关键帧插值（CHKI）。为了为插值提供丰富的空间和结构线索，我们的PoseFuse3D——一个三维信息控制模型，引入了一种新颖的SMPL-X编码器，将三维几何和形状转换为二维潜在条件空间，并配备了一个融合网络，将这些三维线索与二维姿态嵌入相结合。为了评估，我们构建了CHKI-Video，这是一个标注了二维姿态和三维SMPL-X参数的新数据集。实验表明，PoseFuse3D-KI在CHKI-Video上持续超越最先进的基线方法，PSNR提升了9%，LPIPS降低了38%。全面的消融实验证实，我们的PoseFuse3D模型显著提高了插值的保真度。

English

Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

基于生成先验的可控人像关键帧插值

Controllable Human-centric Keyframe Interpolation with Generative Prior

摘要

Support