생성적 사전 지식을 활용한 제어 가능한 인간 중심 키프레임 보간

초록

기존의 보간 방법들은 희소하게 샘플링된 키프레임 사이의 중간 프레임을 생성하기 위해 사전 훈련된 비디오 확산 프라이어를 사용합니다. 3D 기하학적 지도가 없는 경우, 이러한 방법들은 복잡하고 관절이 있는 인간의 움직임에 대해 그럴듯한 결과를 생성하는 데 어려움을 겪으며, 합성된 동역학에 대한 제어가 제한적입니다. 본 논문에서는 3D 인간 지도 신호를 확산 과정에 통합하여 제어 가능한 인간 중심 키프레임 보간(CHKI)을 위한 새로운 프레임워크인 PoseFuse3D 키프레임 보간기(PoseFuse3D-KI)를 소개합니다. 보간을 위한 풍부한 공간 및 구조적 단서를 제공하기 위해, 우리의 PoseFuse3D는 3D 기하학과 형태를 2D 잠재 조건 공간으로 변환하는 새로운 SMPL-X 인코더와 이러한 3D 단서를 2D 포즈 임베딩과 통합하는 융합 네트워크를 특징으로 하는 3D 정보 기반 제어 모델입니다. 평가를 위해, 우리는 2D 포즈와 3D SMPL-X 파라미터로 주석이 달린 새로운 데이터셋인 CHKI-Video를 구축했습니다. 우리는 PoseFuse3D-KI가 CHKI-Video에서 최신 베이스라인을 지속적으로 능가하며, PSNR에서 9%의 개선과 LPIPS에서 38%의 감소를 달성함을 보여줍니다. 포괄적인 절제 실험은 우리의 PoseFuse3D 모델이 보간 충실도를 향상시킴을 입증합니다.

English

Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.