擴散模型知曉透明度:重新利用影片擴散技術實現透明物體深度與法向估計
Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
December 29, 2025
作者: Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao
cs.AI
摘要
透明物體對感知系統而言向來是出了名的難題:折射、反射和透射現象破壞了立體視覺、飛時測距以及純鑑別式單目深度估計的基礎假設,導致預測結果出現空洞和時間不穩定性。我們的核心發現是:現代影片擴散模型已能合成逼真的透明現象,暗示其內化了光學規律。據此我們構建了TransPhy3D——一個透明/反射場景的合成影片數據集,包含使用Blender/Cycles渲染的1.1萬段序列。場景由經過篩選的靜態資產庫(涵蓋豐富類別)與程序化生成資產庫(富含多樣幾何形態)組合而成,並配備玻璃/塑料/金屬材質。通過基於物理的光線追蹤與OptiX降噪技術,我們渲染出RGB+深度+法向量的真值數據。基於大型影片擴散模型,我們採用輕量級LoRA適配器學習從影片到深度(及法向量)的轉換器。訓練時將RGB與(帶噪)深度潛在表徵在DiT骨幹網絡中拼接,並在TransPhy3D與現有逐幀合成數據集上聯合訓練,從而實現對任意長度輸入影片的時序一致預測。最終模型DKT在涉及透明物體的真實與合成影片基準測試(ClearPose、DREDS的已知類別/新類別集、TransPhy3D測試集)中實現零樣本最優性能。相較強勁的圖像/影片基線模型,其精度與時序一致性均有提升,而法向量預測變體更在ClearPose上創下影片法向量估計最佳紀錄。緊湊的13億參數版本運行速度約0.17秒/幀。集成至抓取系統後,DKT的深度預測在透明、反射及漫反射表面均提升抓取成功率,超越現有估計器。這些成果共同印證了一個更宏觀的論斷:「擴散模型洞悉透明法則」。生成式影片先驗能夠以高效、無標註的方式遷移為魯棒且時序連貫的感知系統,助力挑戰性現實場景下的機械操作。
English
Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.