基於生成先驗的可控人體關鍵幀插值

摘要

现有的插值方法利用预训练的视频扩散先验，在稀疏采样的关键帧之间生成中间帧。在缺乏三维几何引导的情况下，这些方法难以对复杂、关节化的人体运动产生合理的结果，并且对合成动态的控制有限。本文中，我们提出了PoseFuse3D关键帧插值器（PoseFuse3D-KI），这是一个新颖的框架，它将三维人体引导信号整合到扩散过程中，以实现可控的人体中心关键帧插值（CHKI）。为了为插值提供丰富的空间和结构线索，我们的PoseFuse3D，一个三维信息控制模型，具备一个新颖的SMPL-X编码器，将三维几何和形状转换为二维潜在条件空间，以及一个融合网络，将这些三维线索与二维姿态嵌入相结合。为了评估，我们构建了CHKI-Video，一个标注有二维姿态和三维SMPL-X参数的新数据集。我们展示了PoseFuse3D-KI在CHKI-Video上始终优于最先进的基线，PSNR提高了9%，LPIPS减少了38%。全面的消融实验证明，我们的PoseFuse3D模型提高了插值的保真度。

English

Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.