世界追蹤:超越可見的生成式像素對齊幾何
World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible
June 11, 2026
作者: Hao Zhang, Mohamed El Banani, Jen-Hao Cheng, Paul Zhang, Yi Hua, Ben Mildenhall, Christoph Lassner, Narendra Ahuja, Gengshan Yang
cs.AI
摘要
圖像轉3D的方法常在忠實度與完整性之間取捨:深度估計器雖錨定於輸入像素,卻止步於可見表面;而圖像轉3D模型雖能生成完整的形狀,卻常與輸入對齊不良。我們提出「世界追蹤」(World Tracing),一種生成式像素對齊幾何表示法,可在預測與觀測像素對齊的3D點之同時,完成可見表面以外的幾何結構。針對每個輸入像素,世界追蹤預測一組有序的相機空間3D點堆疊,第一層代表可見表面,後續層則依序代表與被遮擋表面的前後交點。我們透過世界追蹤擴散變換器(WT-DiT)實例化此表示法,該模型將多層幾何視為獨立的去噪令牌,並透過分解式與全局注意力機制相互耦合。WT-DiT以像素空間流匹配與混合噪聲排程進行訓練,平衡可見表面的重建與被遮擋幾何的生成。世界追蹤在物體、場景與動態基準測試中,於可見表面重建及完整幾何生成方面均表現優異,超越深度預測器與圖像轉3D生成器。它同時保留了2D到3D的對應關係,能支援文字驅動的3D場景編輯、以幾何為條件的全新視角影片合成,以及無需訓練即可整合紋理網格生成器。
English
Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.