世界追踪：超越可见的生成式像素对齐几何

摘要

图像到三维方法通常在忠实性与完整性之间进行权衡：深度估计器锚定于输入像素，但仅限于可见表面；而图像到三维模型可生成完整形状，却常与输入存在错位。我们提出世界追踪（World Tracing），这是一种生成式像素对齐几何表示，它能预测与观测像素对齐的三维点，同时补全可见表面之外的几何结构。对于每个输入像素，世界追踪预测一组有序的相机空间三维点堆栈，其中第一层表示可见表面，后续层表示从前到后与遮挡表面的交点。我们通过世界追踪扩散变换器（WT-DiT）实例化该表示，该模型将多个几何层视为独立的去噪令牌，并通过分解注意力和全局注意力耦合。WT-DiT 采用像素空间流匹配和混合噪声调度进行训练，以平衡可见表面重建与遮挡几何生成。世界追踪在物体、场景和动态基准测试的可见表面重建与完整几何生成上均取得强劲性能，超越了深度预测器和图像到三维生成器。它还保留了二维到三维的对应关系，从而实现文本驱动的三维场景编辑、几何条件的新视角视频合成，以及与纹理网格生成器的无需训练集成。

English

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.