SAM3D: 세그먼트 애니씽 모델을 통한 제로샷 3D 객체 탐지

초록

대규모 언어 모델의 발전과 함께 ChatGPT와 같은 놀라운 언어 시스템이 많은 작업에서 두각을 나타내며 기초 모델의 놀라운 힘을 보여주고 있다. 이러한 기초 모델의 능력을 시각 작업에 적용하고자 하는 목표로, 최근 이미지 분할을 위한 시각 기초 모델인 Segment Anything Model(SAM)이 제안되었으며, 많은 2D 하위 작업에서 강력한 제로샷 능력을 보여주고 있다. 그러나 SAM이 3D 시각 작업, 특히 3D 객체 탐지에 적용될 수 있는지에 대해서는 아직 탐구되지 않았다. 이러한 영감을 바탕으로, 본 논문에서는 SAM의 제로샷 능력을 3D 객체 탐지에 적용하는 방법을 탐구한다. 우리는 SAM 기반의 BEV(Bird's Eye View) 처리 파이프라인을 제안하여 대규모 Waymo 오픈 데이터셋에서 객체를 탐지하고 유망한 결과를 얻었다. 초기 시도로서, 우리의 방법은 시각 기초 모델을 활용한 3D 객체 탐지로 한 걸음 나아가며, 3D 시각 작업에서 그들의 힘을 발휘할 수 있는 기회를 제시한다. 코드는 https://github.com/DYZhang09/SAM3D에서 공개되었다.

English

With the development of large language models, many remarkable linguistic systems like ChatGPT have thrived and achieved astonishing success on many tasks, showing the incredible power of foundation models. In the spirit of unleashing the capability of foundation models on vision tasks, the Segment Anything Model (SAM), a vision foundation model for image segmentation, has been proposed recently and presents strong zero-shot ability on many downstream 2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be explored, especially 3D object detection. With this inspiration, we explore adapting the zero-shot ability of SAM to 3D object detection in this paper. We propose a SAM-powered BEV processing pipeline to detect objects and get promising results on the large-scale Waymo open dataset. As an early attempt, our method takes a step toward 3D object detection with vision foundation models and presents the opportunity to unleash their power on 3D vision tasks. The code is released at https://github.com/DYZhang09/SAM3D.

SAM3D: 세그먼트 애니씽 모델을 통한 제로샷 3D 객체 탐지

SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model

초록

Support