Stable Part Diffusion 4D: マルチビューRGBとキネマティックパーツのビデオ生成

要旨

本論文では、単眼入力からペアとなるRGB映像とキネマティックパーツ映像を生成するフレームワーク、Stable Part Diffusion 4D（SP4D）を提案する。従来のパーツセグメンテーション手法が外観に基づく意味的手がかりに依存するのに対し、SP4Dはキネマティックパーツ、すなわち物体の関節構造に整合し、視点や時間を超えて一貫性のある構造的構成要素を生成することを学習する。SP4Dは、RGBフレームと対応するパーツセグメンテーションマップを共同で合成するデュアルブランチ拡散モデルを採用している。アーキテクチャを簡素化し、異なるパーツ数を柔軟に可能にするため、パーツマスクを連続的なRGB風画像にマッピングする空間カラーエンコーディングスキームを導入した。このエンコーディングにより、セグメンテーションブランチはRGBブランチの潜在VAEを共有しつつ、単純な後処理でパーツセグメンテーションを復元することが可能となる。双方向拡散融合（BiDiFuse）モジュールは、ブランチ間の一貫性を強化し、パーツ予測の空間的・時間的整合性を促進するコントラスティブパーツ一貫性損失によってサポートされる。生成された2Dパーツマップは、わずかな手動調整で3Dにリフトし、骨格構造とハーモニックスキニングウェイトを導出できることを実証する。SP4Dのトレーニングと評価のために、Objaverse XL（Deitke et al., 2023）から選別・処理された20,000以上のリグ付きオブジェクトからなるキュレーションデータセット、KinematicParts20Kを構築した。各オブジェクトは、マルチビューRGBおよびパーツ映像シーケンスとペアになっている。実験により、SP4Dが実世界の映像、新規生成オブジェクト、稀な関節ポーズを含む多様なシナリオに強く汎化し、下流のアニメーションやモーション関連タスクに適したキネマティックを意識した出力を生成することが示された。

English

We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

Stable Part Diffusion 4D: マルチビューRGBとキネマティックパーツのビデオ生成

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

要旨

Support