ChatPaper.aiChatPaper

扩散模型洞悉透明度:利用视频扩散模型实现透明物体深度与法向估计

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

December 29, 2025
作者: Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao
cs.AI

摘要

透明物体对于感知系统而言始终是公认的难题:折射、反射和透射现象打破了立体视觉、飞行时间法以及纯判别式单目深度估计的基本假设,导致预测结果出现空洞和时间上的不稳定。我们的核心发现是,现代视频扩散模型已经能够合成逼真的透明现象,这表明它们已内化了光学规律。我们构建了TransPhy3D——一个透明/反射场景的合成视频语料库:使用Blender/Cycles渲染的1.1万段序列。场景由经过筛选的类别丰富的静态资产与形状丰富的程序化资产组合而成,并搭配玻璃/塑料/金属材质。通过基于物理的光线追踪和OptiX降噪技术,我们渲染出RGB+深度+法线图。基于大型视频扩散模型,我们通过轻量级LoRA适配器学习视频到视频的深度(及法线)转换器。训练过程中,我们在DiT主干网络中拼接RGB与(含噪声的)深度潜在特征,并在TransPhy3D和现有逐帧合成数据集上协同训练,从而实现对任意长度输入视频的时间一致性预测。所得模型DKT在涉及透明场景的真实与合成视频基准测试(ClearPose、DREDS的已知类别/新类别数据集、TransPhy3D测试集)中实现了零样本state-of-the-art。相较于强图像/视频基线方法,其精度和时间一致性均有提升,而法线预测变体在ClearPose上创下视频法线估计最佳结果。紧凑的13亿参数版本单帧处理耗时约0.17秒。集成到抓取系统后,DKT的深度预测在半透明、反射和漫反射表面场景中均提升成功率,优于现有估计器。这些成果共同印证了一个更广泛的论断:"扩散模型理解透明度"。生成式视频先验能够以高效、无标注的方式转化为鲁棒且时间一致的表征能力,为挑战性现实操控任务提供支撑。
English
Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
PDF321December 31, 2025