生成モデルの事前知識を用いた人間中心の制御可能なキーフレーム補間

要旨

既存の補間手法では、事前学習されたビデオ拡散モデルを利用して、疎にサンプリングされたキーフレーム間の中間フレームを生成します。しかし、3D幾何学的なガイダンスがない場合、これらの手法は複雑で関節的な人間の動きに対して説得力のある結果を生成することが難しく、合成されたダイナミクスに対する制御も限られています。本論文では、3D人間ガイダンス信号を拡散プロセスに統合する新しいフレームワークであるPoseFuse3D Keyframe Interpolator (PoseFuse3D-KI)を提案し、制御可能な人間中心キーフレーム補間（CHKI）を実現します。補間のための豊富な空間的および構造的な手がかりを提供するために、3D情報を活用した制御モデルであるPoseFuse3Dは、3Dジオメトリと形状を2D潜在条件空間に変換する新しいSMPL-Xエンコーダと、これらの3D手がかりを2Dポーズ埋め込みと統合する融合ネットワークを特徴としています。評価のために、2Dポーズと3D SMPL-Xパラメータの両方で注釈付けされた新しいデータセットであるCHKI-Videoを構築しました。PoseFuse3D-KIは、CHKI-Videoにおいて最先端のベースラインを一貫して上回り、PSNRで9%の改善、LPIPSで38%の削減を達成しました。包括的なアブレーション研究により、PoseFuse3Dモデルが補間の忠実度を向上させることが示されました。

English

Existing interpolation methods use pre-trained video diffusion priors to generate intermediate frames between sparsely sampled keyframes. In the absence of 3D geometric guidance, these methods struggle to produce plausible results for complex, articulated human motions and offer limited control over the synthesized dynamics. In this paper, we introduce PoseFuse3D Keyframe Interpolator (PoseFuse3D-KI), a novel framework that integrates 3D human guidance signals into the diffusion process for Controllable Human-centric Keyframe Interpolation (CHKI). To provide rich spatial and structural cues for interpolation, our PoseFuse3D, a 3D-informed control model, features a novel SMPL-X encoder that transforms 3D geometry and shape into the 2D latent conditioning space, alongside a fusion network that integrates these 3D cues with 2D pose embeddings. For evaluation, we build CHKI-Video, a new dataset annotated with both 2D poses and 3D SMPL-X parameters. We show that PoseFuse3D-KI consistently outperforms state-of-the-art baselines on CHKI-Video, achieving a 9% improvement in PSNR and a 38% reduction in LPIPS. Comprehensive ablations demonstrate that our PoseFuse3D model improves interpolation fidelity.

生成モデルの事前知識を用いた人間中心の制御可能なキーフレーム補間

Controllable Human-centric Keyframe Interpolation with Generative Prior

要旨

Support