ChatPaper.aiChatPaper

面向大语言模型三维定位的误差驱动场景编辑

Error-Driven Scene Editing for 3D Grounding in Large Language Models

November 18, 2025
作者: Yue Zhang, Zun Wang, Han Lin, Jialu Li, Jianing Yang, Yonatan Bitton, Idan Szpektor, Mohit Bansal
cs.AI

摘要

儘管三維大語言模型近期取得進展,其在準確將語言對應到三維場景中的視覺與空間元素方面仍存在侷限。這一侷限性部分源於訓練數據過於側重語言推理而缺乏空間理解——由於三維資源稀缺,固有的語義接地偏差未能得到解決。為此,我們提出以三維場景編輯為核心機制,通過細粒度空間操作生成精確的視覺反事實樣本以消除偏差,且無需耗費高昂的場景重建或大規模三維數據採集成本。進一步地,為使編輯操作更具針對性並直擊模型缺陷,我們提出DEER-3D這一誤差驅動框架,採用結構化的"分解、診斷評估、編輯、再訓練"工作流程,而非傳統方法中廣泛或隨機的數據增強策略。具體而言,當檢測到三維大語言模型的語義接地失敗時,本框架首先診斷謂詞層級的具體錯誤類型(如屬性錯誤或空間關係錯誤),隨後執行最小化的謂詞對齊式三維場景編輯(如重新着色或位移),生成針對性反事實監督數據用於迭代式模型微調,從而顯著提升接地精度。我們在多個三維語義接地與場景理解任務的基準測試中評估該編輯流程,通過迭代優化在所有數據集上均取得一致提升。DEER-3D證明了誤差驅動的靶向場景編輯能有效銜接三維大語言模型的語言推理能力與空間接地能力。
English
Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources, leaving inherent grounding biases unresolved. To address this, we propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases through fine-grained spatial manipulation, without requiring costly scene reconstruction or large-scale 3D data collection. Furthermore, to make these edits targeted and directly address the specific weaknesses of the model, we introduce DEER-3D, an error-driven framework following a structured "Decompose, Diagnostic Evaluation, Edit, and Re-train" workflow, rather than broadly or randomly augmenting data as in conventional approaches. Specifically, upon identifying a grounding failure of the 3D-LLM, our framework first diagnoses the exact predicate-level error (e.g., attribute or spatial relation). It then executes minimal, predicate-aligned 3D scene edits, such as recoloring or repositioning, to produce targeted counterfactual supervision for iterative model fine-tuning, significantly enhancing grounding accuracy. We evaluate our editing pipeline across multiple benchmarks for 3D grounding and scene understanding tasks, consistently demonstrating improvements across all evaluated datasets through iterative refinement. DEER-3D underscores the effectiveness of targeted, error-driven scene editing in bridging linguistic reasoning capabilities with spatial grounding in 3D LLMs.
PDF42December 1, 2025