3D-R1：提升三维视觉语言模型推理能力，实现统一场景理解

摘要

大型视觉语言模型（VLMs）在二维视觉理解任务中取得了显著进展，这激发了人们将其能力扩展到三维场景理解的兴趣。然而，当前的三维VLMs由于高质量空间数据的限制以及视角假设的静态性，往往在稳健推理和泛化方面表现欠佳。为解决这些挑战，我们提出了3D-R1，一个旨在增强三维VLMs推理能力的基础模型。具体而言，我们首先利用现有的三维VL数据集和基于Gemini 2.5 Pro的数据引擎，构建了一个包含CoT的高质量合成数据集，命名为Scene-30K，作为3D-R1的冷启动初始化数据。此外，我们在强化学习训练过程中采用了如GRPO等RLHF策略，以增强推理能力，并引入了三种奖励函数：感知奖励、语义相似性奖励和格式奖励，以确保检测精度和回答语义的准确性。更进一步，我们提出了一种动态视角选择策略，自适应地选取对三维场景理解最具信息量的视角。大量实验表明，3D-R1在多个三维场景基准测试中平均提升了10%，凸显了其在增强三维场景理解推理与泛化能力方面的有效性。代码：https://github.com/AIGeeksGroup/3D-R1。网站：https://aigeeksgroup.github.io/3D-R1。

English

Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.