세계 추적: 가시 영역 너머의 생성적 픽셀 정렬 기하학

초록

이미지-3D 방법은 종종 충실도와 완전성 사이에서 절충을 보인다. 깊이 추정기는 입력 픽셀에 고정되지만 가시 표면까지만 추정하는 반면, 이미지-3D 모델은 완전한 형태를 생성하지만 입력과 정렬이 잘못되는 경우가 많다. 우리는 World Tracing을 소개한다. 이는 관찰된 픽셀과 정렬된 3D 점을 예측하면서 가시 표면 너머의 기하를 완성하는 생성적 픽셀 정렬 기하 표현이다. 각 입력 픽셀에 대해 World Tracing은 카메라 공간 3D 점의 정렬된 스택을 예측하며, 첫 번째 층은 가시 표면을 나타내고 이후 층들은 가려진 표면과의 전면-후면 교차점을 나타낸다. 우리는 이 표현을 World Tracing 확산 트랜스포머(WT-DiT)로 구체화하며, 이는 여러 기하 층을 분해 및 전역 주의를 통해 결합된 별도의 노이즈 제거 토큰으로 처리한다. WT-DiT는 픽셀 공간 흐름 매칭과 가시 표면 복원과 가려진 기하 생성 간의 균형을 맞추는 혼합 노이즈 스케줄로 훈련된다. World Tracing은 객체, 장면, 동적 벤치마크 전반에서 가시 표면 복원 및 완전한 기하 생성에서 강력한 성능을 달성하며, 깊이 추정기와 이미지-3D 생성기 모두를 능가한다. 또한 2D-3D 대응을 유지하여 텍스트 기반 3D 장면 편집, 기하 조건의 새로운 시점 비디오 합성, 텍스처 메시 생성기와의 훈련 없는 통합을 가능하게 한다.

English

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.