前馈式三维场景建模：问题驱动视角

摘要

从二维输入重建三维表征是计算机视觉与图形学领域的一项基础任务，成为理解物理世界并与之交互的基石。传统方法虽能实现高保真度重建，但受限于耗时的逐场景优化或特定类别的训练，制约了其实际部署与可扩展性。因此，可泛化的前馈式三维重建技术近年来快速发展。这类方法通过学习将图像直接映射为三维表征的模型，仅需单次前向传播即可实现高效重建，并具备强大的跨场景泛化能力。本综述源于一项关键发现：尽管现有前馈方法的几何输出表征各异（从隐式场到显式基元），但它们共享相似的高层架构模式，如图像特征提取主干、多视图信息融合机制和几何感知设计原则。基于此，我们抽象掉表征形式的差异，转而聚焦于模型设计，提出一种新颖的、与输出格式无关的模型设计策略分类法。该分类法将研究方向归纳为驱动近期发展的五大核心问题：特征增强、几何感知、模型效率、数据增强策略和时序感知模型。为通过实证基础与标准化评估支撑该分类体系，我们进一步系统梳理了相关基准测试与数据集，并基于前馈三维模型对现实应用进行了广泛讨论和分类。最后，我们展望了未来研究方向，以应对可扩展性、评估标准和世界建模等开放挑战。

English

Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per-scene optimization or category-specific training, which hinders their practical deployment and scalability. Hence, generalizable feed-forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross-scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed-forward approaches share similar high-level architectural patterns, such as image feature extraction backbones, multi-view information fusion mechanisms, and geometry-aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal-aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real-world applications based on feed-forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.