三维视觉手册：数据、学习范式与应用

摘要

3D视觉技术正快速演进，其驱动力源自日益多元化的数据表征、学习范式与建模策略。然而，该领域仍因表征体系与基准测试的碎片化而难以构建关于效率、保真度及可扩展性的统一视角。本文提出一种以数据为中心的3D视觉分类体系，将几何表征、数据集、学习框架与应用场景整合在同一概念图谱中。我们首先剖析3D数据的主要结构表征——点云、网格、体素与3D高斯体——及其采集流程。继而探讨数据集设计、基准构建与监督范式如何推动近期进展，涵盖2D监督的3D学习、隐式神经表征与4D世界建模。通过这种整合性视角，我们厘清表征体系、学习范式与下游任务（如重建、生成与视频建模）之间的关联，为平衡效率与保真度、构建多模态几何基元的新兴趋势提供统合性认知。

English

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.