3Dビジョンのクックブック：データ、学習パラダイム、および応用

要旨

3Dビジョンは、ますます多様化するデータ表現、学習パラダイム、およびモデリング戦略によって推進され、急速に進化してきた。しかしながら、この分野は表現やベンチマーク間で断片化されたままであるため、効率性、忠実性、およびスケーラビリティに関する統一的な視点を構築することが困難となっている。本研究では、幾何学的表現、データセット、学習フレームワーク、およびアプリケーションを単一の概念マップ内で結びつける、データ中心の3Dビジョンの分類法を提供する。まず、3Dデータの主要な構造表現（点群、メッシュ、ボクセル、3Dガウシアン）とその獲得パイプラインを分析する。次に、データセット設計、ベンチマーク構築、および教師あり学習方式が、2D教師あり3D学習、暗黙的ニューラル表現、および4Dワールドモデリングにわたる最近の進歩をどのように形成しているかを検討する。この統合的視点を通じて、表現、学習パラダイム、および再構成、生成、ビデオモデリングにおける下流タスク間の関係を明確にし、効率性と忠実性のバランス、およびマルチモーダル幾何学的基盤化に向けた新たなトレンドの統合的見解を提供する。

English

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.