3D-VCD：基于视觉对比解码缓解3D-LLM具身智能体的幻觉问题

摘要

大型多模态模型正日益成为在3D环境中运行的具身智能体的推理核心，但它们仍容易产生幻觉，导致不安全且缺乏依据的决策。现有的推理时幻觉缓解方法主要针对2D视觉语言场景，无法迁移到具身3维推理领域——后者的错误主要源于物体存在性、空间布局和几何基础问题，而非像素级不一致。我们提出3D-VCD，首个面向3D具身智能体幻觉缓解的推理时视觉对比解码框架。该方法通过对以物体为中心的表示施加语义和几何扰动（如类别替换、坐标或尺寸破坏）来构建扭曲的3D场景图。通过对比原始与扭曲3D上下文下的预测结果，我们的方法能抑制那些对真实场景证据不敏感、仅由语言先验驱动的标记。在3D-POPE和HEAL基准测试上的实验表明，3D-VCD无需重新训练即可持续提升具身推理的可靠性，证实了基于结构化3D表示的推理时对比解码是实现更可靠具身智能的有效实践路径。

English

Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.

3D-VCD：基于视觉对比解码缓解3D-LLM具身智能体的幻觉问题

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

摘要

Support