世界追蹤：超越可見的生成式像素對齊幾何

摘要

圖像轉3D的方法常在忠實度與完整性之間取捨：深度估計器雖錨定於輸入像素，卻止步於可見表面；而圖像轉3D模型雖能生成完整的形狀，卻常與輸入對齊不良。我們提出「世界追蹤」（World Tracing），一種生成式像素對齊幾何表示法，可在預測與觀測像素對齊的3D點之同時，完成可見表面以外的幾何結構。針對每個輸入像素，世界追蹤預測一組有序的相機空間3D點堆疊，第一層代表可見表面，後續層則依序代表與被遮擋表面的前後交點。我們透過世界追蹤擴散變換器（WT-DiT）實例化此表示法，該模型將多層幾何視為獨立的去噪令牌，並透過分解式與全局注意力機制相互耦合。WT-DiT以像素空間流匹配與混合噪聲排程進行訓練，平衡可見表面的重建與被遮擋幾何的生成。世界追蹤在物體、場景與動態基準測試中，於可見表面重建及完整幾何生成方面均表現優異，超越深度預測器與圖像轉3D生成器。它同時保留了2D到3D的對應關係，能支援文字驅動的3D場景編輯、以幾何為條件的全新視角影片合成，以及無需訓練即可整合紋理網格生成器。

English

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.