稳定部件扩散4D：多视角RGB与运动部件视频生成

摘要

我们提出了稳定部件扩散四维框架（SP4D），该框架能够从单目输入生成配对的RGB视频与运动学部件视频。不同于依赖基于外观语义线索的传统部件分割方法，SP4D学习生成运动学部件——这些结构组件与物体关节对齐，并在视角和时间上保持一致。SP4D采用双分支扩散模型，联合合成RGB帧及对应的部件分割图。为简化架构并灵活支持不同部件数量，我们引入了一种空间色彩编码方案，将部件掩码映射为连续的类RGB图像。此编码使得分割分支能够共享RGB分支的潜在VAE，同时通过简单的后处理即可恢复部件分割。双向扩散融合模块（BiDiFuse）增强了跨分支一致性，辅以对比部件一致性损失，促进部件预测的空间与时间对齐。我们展示了生成的2D部件图可被提升至3D，以推导骨骼结构及和谐蒙皮权重，仅需少量手动调整。为训练和评估SP4D，我们构建了KinematicParts20K数据集，这是一个从Objaverse XL（Deitke等，2023）中精选并处理超过20K个绑定对象的精选数据集，每个对象均配有多视角RGB及部件视频序列。实验表明，SP4D在多样化场景中展现出强大的泛化能力，包括真实世界视频、新生成对象及罕见关节姿态，生成的运动感知输出适用于下游动画及运动相关任务。

English

We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts - structural components aligned with object articulation and consistent across views and time. SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify the architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing. A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments. To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL (Deitke et al., 2023), each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

稳定部件扩散4D：多视角RGB与运动部件视频生成

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

摘要

Support