비전-언어 모델을 활용한 3D 객체 탐지 기술 리뷰

초록

본 리뷰는 3D 비전과 멀티모달 AI의 교차점에서 빠르게 발전하고 있는 비전-언어 모델(VLMs)을 활용한 3D 객체 탐지에 대한 체계적인 분석을 제공합니다. 100편 이상의 연구 논문을 검토함으로써, 비전-언어 모델을 사용한 3D 객체 탐지에 전념한 첫 번째 체계적인 분석을 제시합니다. 먼저, 비전-언어 모델을 활용한 3D 객체 탐지의 독특한 도전 과제를 개괄하며, 공간 추론과 데이터 복잡성 측면에서 2D 탐지와의 차이점을 강조합니다. 포인트 클라우드와 복셀 그리드를 사용하는 전통적인 접근법을 CLIP 및 3D LLMs와 같은 현대적인 비전-언어 프레임워크와 비교하며, 이들이 개방형 어휘 탐지와 제로샷 일반화를 가능하게 하는 방식을 살펴봅니다. 텍스트와 3D 특징을 효과적으로 정렬하여 비전-언어 모델을 활용한 3D 객체 탐지를 가능하게 하는 주요 아키텍처, 사전 학습 전략, 프롬프트 엔지니어링 방법을 검토합니다. 시각화 예제와 평가 벤치마크를 통해 성능과 동작을 설명하며, 마지막으로 제한된 3D-언어 데이터셋과 계산적 요구 사항과 같은 현재의 도전 과제를 강조하고, 비전-언어 모델을 활용한 3D 객체 탐지를 발전시키기 위한 미래 연구 방향을 제안합니다. >객체 탐지, 비전-언어 모델, 에이전트, VLMs, LLMs, AI

English

This review provides a systematic analysis of comprehensive survey of 3D object detection with vision-language models(VLMs) , a rapidly advancing area at the intersection of 3D vision and multimodal AI. By examining over 100 research papers, we provide the first systematic analysis dedicated to 3D object detection with vision-language models. We begin by outlining the unique challenges of 3D object detection with vision-language models, emphasizing differences from 2D detection in spatial reasoning and data complexity. Traditional approaches using point clouds and voxel grids are compared to modern vision-language frameworks like CLIP and 3D LLMs, which enable open-vocabulary detection and zero-shot generalization. We review key architectures, pretraining strategies, and prompt engineering methods that align textual and 3D features for effective 3D object detection with vision-language models. Visualization examples and evaluation benchmarks are discussed to illustrate performance and behavior. Finally, we highlight current challenges, such as limited 3D-language datasets and computational demands, and propose future research directions to advance 3D object detection with vision-language models. >Object Detection, Vision-Language Models, Agents, VLMs, LLMs, AI