穩定部件擴散4D:多視角RGB與運動部件視頻生成
Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation
September 12, 2025
作者: Hao Zhang, Chun-Han Yao, Simon Donné, Narendra Ahuja, Varun Jampani
cs.AI
摘要
我們提出了穩定部件擴散四維框架(Stable Part Diffusion 4D, SP4D),這是一個從單目輸入生成配對RGB與運動部件視頻的系統。與依賴於外觀語義線索的傳統部件分割方法不同,SP4D學習生成運動部件——這些結構組件與物體關節對齊,並在視角和時間上保持一致。SP4D採用雙分支擴散模型,聯合合成RGB幀及相應的部件分割圖。為了簡化架構並靈活支持不同部件數量,我們引入了一種空間色彩編碼方案,將部件掩碼映射到連續的類RGB圖像。這種編碼方式使分割分支能夠共享RGB分支的潛在變分自編碼器(VAE),同時通過簡單的後處理即可恢復部件分割。雙向擴散融合模塊(BiDiFusion, BiDiFuse)增強了分支間的一致性,並輔以對比部件一致性損失,以促進部件預測的空間與時間對齊。我們展示了生成的二維部件圖可被提升至三維,從而推導出骨骼結構和諧波蒙皮權重,僅需少量手動調整。為了訓練和評估SP4D,我們構建了KinematicParts20K數據集,這是一個精選自Objaverse XL(Deitke等,2023)的超過20,000個綁定物體的數據集,每個物體都配備了多視角RGB與部件視頻序列。實驗表明,SP4D在各種場景下展現出強大的泛化能力,包括真實世界視頻、新生成物體及罕見的關節姿態,生成適合下游動畫與運動相關任務的運動感知輸出。
English
We present Stable Part Diffusion 4D (SP4D), a framework for generating paired
RGB and kinematic part videos from monocular inputs. Unlike conventional part
segmentation methods that rely on appearance-based semantic cues, SP4D learns
to produce kinematic parts - structural components aligned with object
articulation and consistent across views and time. SP4D adopts a dual-branch
diffusion model that jointly synthesizes RGB frames and corresponding part
segmentation maps. To simplify the architecture and flexibly enable different
part counts, we introduce a spatial color encoding scheme that maps part masks
to continuous RGB-like images. This encoding allows the segmentation branch to
share the latent VAE from the RGB branch, while enabling part segmentation to
be recovered via straightforward post-processing. A Bidirectional Diffusion
Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a
contrastive part consistency loss to promote spatial and temporal alignment of
part predictions. We demonstrate that the generated 2D part maps can be lifted
to 3D to derive skeletal structures and harmonic skinning weights with few
manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K,
a curated dataset of over 20K rigged objects selected and processed from
Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part
video sequences. Experiments show that SP4D generalizes strongly to diverse
scenarios, including real-world videos, novel generated objects, and rare
articulated poses, producing kinematic-aware outputs suitable for downstream
animation and motion-related tasks.