ChatPaper.aiChatPaper

基于误差驱动的三维场景编辑:大语言模型中的空间语义定位

Error-Driven Scene Editing for 3D Grounding in Large Language Models

November 18, 2025
作者: Yue Zhang, Zun Wang, Han Lin, Jialu Li, Jianing Yang, Yonatan Bitton, Idan Szpektor, Mohit Bansal
cs.AI

摘要

尽管三维大语言模型(3D-LLMs)近期取得进展,其在将语言准确关联到三维环境中的视觉与空间元素方面仍存在局限。这一局限部分源于训练数据因三维资源稀缺而侧重于语言推理而非空间理解,导致固有的关联偏差未能解决。为此,我们提出以三维场景编辑为核心机制,通过精细化空间操作生成精准的视觉反事实样本以消除偏差,无需昂贵的场景重建或大规模三维数据采集。进一步地,为使编辑具有针对性并直击模型弱点,我们提出DEER-3D——一个遵循“分解、诊断评估、编辑、再训练”结构化流程的错误驱动框架,而非传统方法中广泛或随机的数据增强。具体而言,当检测到3D-LLM的关联错误时,该框架首先诊断出具体的谓词级错误(如属性或空间关系错误),随后执行最小化的谓词对齐式三维场景编辑(如重着色或重定位),生成针对性反事实监督信号用于迭代式模型微调,显著提升关联准确性。我们在多个三维关联与场景理解任务的基准测试中验证编辑流程,通过迭代优化在所有评估数据集上均取得稳定提升。DEER-3D证明了错误驱动的靶向场景编辑在弥合三维大语言模型语言推理与空间关联能力方面的有效性。
English
Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources, leaving inherent grounding biases unresolved. To address this, we propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases through fine-grained spatial manipulation, without requiring costly scene reconstruction or large-scale 3D data collection. Furthermore, to make these edits targeted and directly address the specific weaknesses of the model, we introduce DEER-3D, an error-driven framework following a structured "Decompose, Diagnostic Evaluation, Edit, and Re-train" workflow, rather than broadly or randomly augmenting data as in conventional approaches. Specifically, upon identifying a grounding failure of the 3D-LLM, our framework first diagnoses the exact predicate-level error (e.g., attribute or spatial relation). It then executes minimal, predicate-aligned 3D scene edits, such as recoloring or repositioning, to produce targeted counterfactual supervision for iterative model fine-tuning, significantly enhancing grounding accuracy. We evaluate our editing pipeline across multiple benchmarks for 3D grounding and scene understanding tasks, consistently demonstrating improvements across all evaluated datasets through iterative refinement. DEER-3D underscores the effectiveness of targeted, error-driven scene editing in bridging linguistic reasoning capabilities with spatial grounding in 3D LLMs.
PDF42December 1, 2025