3D視覺食譜：數據、學習範式與應用

摘要

三維視覺技術在多樣性日益增長的數據表示、學習範式與建模策略驅動下快速演進。然而，該領域仍因表示形式與基準測試的碎片化而難以建立效率、保真度與可擴展性方面的統整視角。本研究提出一套以數據為中心的三維視覺分類法，將幾何表示、數據集、學習框架與應用場景整合於單一概念圖譜中。我們首先分析三維數據的主要結構性表示——點雲、網格、體素與三維高斯——及其獲取流程。接著探討數據集設計、基準建構與監督機制如何推動近期進展，涵蓋二維監督三維學習、隱式神經表示及四維世界建模。透過此整合性視角，我們釐清表示形式、學習範式與下游任務（重建、生成與影片建模）之間的關聯，並針對效率與保真度平衡、多模態幾何基礎等新興趨勢提供統整觀點。

English

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.