3D-R1: 통합적 장면 이해를 위한 3D 시각-언어 모델의 추론 능력 향상

초록

대규모 시각-언어 모델(VLMs)은 2D 시각 이해 작업에서 상당한 진전을 이루며, 이러한 능력을 3D 장면 이해로 확장하려는 관심을 불러일으켰다. 그러나 현재의 3D VLMs는 고품질 공간 데이터의 부족과 시점 가정의 정적 특성으로 인해 견고한 추론과 일반화에 어려움을 겪고 있다. 이러한 문제를 해결하기 위해, 우리는 3D VLMs의 추론 능력을 강화하는 기초 모델인 3D-R1을 제안한다. 구체적으로, 우리는 먼저 기존의 3D-VL 데이터셋과 Gemini 2.5 Pro 기반의 데이터 엔진을 활용하여 CoT(Chain-of-Thought)가 포함된 고품질 합성 데이터셋인 Scene-30K를 구축한다. 이는 3D-R1의 콜드 스타트 초기화 데이터로 사용된다. 또한, 강화 학습 훈련 과정에서 GRPO와 같은 RLHF(Reinforcement Learning from Human Feedback) 정책을 활용하여 추론 능력을 강화하고, 탐지 정확도와 답변의 의미론적 정밀도를 유지하기 위해 인식 보상, 의미론적 유사성 보상 및 형식 보상이라는 세 가지 보상 함수를 도입한다. 더 나아가, 3D 장면 이해를 위해 가장 유익한 시점을 적응적으로 선택하는 동적 시점 선택 전략을 소개한다. 다양한 3D 장면 벤치마크에서 3D-R1은 평균 10%의 성능 향상을 보여주며, 3D 장면 이해에서의 추론 및 일반화 능력 강화의 효과를 입증한다. 코드: https://github.com/AIGeeksGroup/3D-R1. 웹사이트: https://aigeeksgroup.github.io/3D-R1.

English

Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.

3D-R1: 통합적 장면 이해를 위한 3D 시각-언어 모델의 추론 능력 향상

3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

초록

Support