3D-R1:提升三维视觉语言模型中的推理能力以实现统一场景理解
3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
July 31, 2025
作者: Ting Huang, Zeyu Zhang, Hao Tang
cs.AI
摘要
大型视觉语言模型(VLMs)在二维视觉理解任务中取得了显著进展,激发了将这些能力扩展到三维场景理解的兴趣。然而,当前的三维VLMs由于高质量空间数据的限制以及视角假设的静态性,往往在稳健推理和泛化方面表现不佳。为应对这些挑战,我们提出了3D-R1,一个旨在增强三维VLMs推理能力的基础模型。具体而言,我们首先利用现有的三维VL数据集和基于Gemini 2.5 Pro的数据引擎,构建了一个包含因果推理链(CoT)的高质量合成数据集,命名为Scene-30K,作为3D-R1的冷启动初始化数据。此外,我们在强化学习训练过程中采用了如GRPO等RLHF策略,以增强推理能力,并引入了三种奖励函数:感知奖励、语义相似性奖励和格式奖励,以确保检测准确性和回答语义的精确性。进一步地,我们提出了一种动态视角选择策略,自适应地选取对三维场景理解最具信息量的视角。大量实验表明,3D-R1在多个三维场景基准测试中平均提升了10%,凸显了其在增强三维场景理解中的推理与泛化能力的有效性。代码:https://github.com/AIGeeksGroup/3D-R1。网站:https://aigeeksgroup.github.io/3D-R1。
English
Large vision-language models (VLMs) have made significant strides in 2D
visual understanding tasks, sparking interest in extending these capabilities
to 3D scene understanding. However, current 3D VLMs often struggle with robust
reasoning and generalization due to limitations in high-quality spatial data
and the static nature of viewpoint assumptions. To address these challenges, we
propose 3D-R1, a foundation model that enhances the reasoning capabilities of
3D VLMs. Specifically, we first construct a high-quality synthetic dataset with
CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine
based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1.
Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning
training process to enhance reasoning capabilities and introduce three reward
functions: a perception reward, a semantic similarity reward and a format
reward to maintain detection accuracy and answer semantic precision.
Furthermore, we introduce a dynamic view selection strategy that adaptively
chooses the most informative perspectives for 3D scene understanding. Extensive
experiments demonstrate that 3D-R1 delivers an average improvement of 10%
across various 3D scene benchmarks, highlighting its effectiveness in enhancing
reasoning and generalization in 3D scene understanding. Code:
https://github.com/AIGeeksGroup/3D-R1. Website:
https://aigeeksgroup.github.io/3D-R1.