3D 비전 요리책: 데이터, 학습 패러다임, 그리고 응용

초록

3D 비전은 점점 더 다양한 데이터 표현 방식, 학습 패러다임, 모델링 전략에 힘입어 빠르게 발전해 왔다. 그러나 이 분야는 여전히 표현 방식과 벤치마크 전반에 걸쳐 파편화되어 있어, 효율성, 충실도, 확장성에 대한 통합적 관점을 개발하기 어렵게 만든다. 본 연구는 기하학적 표현, 데이터셋, 학습 프레임워크, 응용을 하나의 개념적 지도로 연결하는 데이터 중심의 3D 비전 분류 체계를 제공한다. 먼저 3D 데이터의 주요 구조적 표현 방식(포인트 클라우드, 메시, 복셀, 3D 가우시안)과 그 획득 파이프라인을 분석한다. 이어서 데이터셋 설계, 벤치마크 구축, 감독 방식이 2D 감독 3D 학습, 암시적 신경 표현, 4D 세계 모델링에 이르기까지 최근의 발전을 어떻게 형성했는지 살펴본다. 이러한 통합적 렌즈를 통해 표현, 학습 패러다임, 재구성, 생성 및 비디오 모델링의 하위 작업 간의 관계를 명확히 하고, 효율성과 충실도의 균형 및 다중 양식 기하학적 기반 구축을 향한 새로운 추세에 대한 통합적 시각을 제시한다.

English

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.