3D-VCD：通过视觉对比解码缓解3D-LLM具身智能体的幻觉问题

摘要

大型多模态模型正日益成为在三维环境中运作的具身智能体的推理核心，但它们仍容易产生幻觉，导致不安全且缺乏依据的决策。现有推理期幻觉缓解方法主要针对二维视觉语言场景，无法迁移到具身三维推理领域——后者的错误主要源于物体存在性、空间布局和几何基础问题，而非像素级不一致。我们提出3D-VCD，首个面向三维具身智能体幻觉缓解的推理期视觉对比解码框架。该方法通过对以物体为中心的表示施加语义和几何扰动（如类别替换、坐标或尺寸破坏）来构建扭曲的三维场景图。通过对比原始与扭曲三维上下文下的预测结果，我们的方法能抑制那些对场景实证不敏感、仅由语言先验驱动的标记。我们在3D-POPE和HEAL基准测试中评估3D-VCD，证明其无需重新训练即可持续提升基础推理能力，确立了基于结构化三维表示的推理期对比解码作为实现更可靠具身智能的有效实践路径。

English

Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.

3D-VCD：通过视觉对比解码缓解3D-LLM具身智能体的幻觉问题

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

摘要

Support