3D-VCD: 視覚的対照デコーディングによる3D-LLM具現化エージェントの幻覚緩和

要旨

大規模マルチモーダルモデルは、3D環境で動作するエンボディエージェントの推論コアとしてますます利用されているが、依然として幻覚（ハルシネーション）を起こしやすく、安全でなく接地（グラウンディング）されていない決定を生み出す可能性がある。既存の推論時幻覚軽減手法の多くは2Dの視覚言語設定を対象としており、ピクセルレベルの不一致ではなく、オブジェクトの存在、空間的レイアウト、幾何学的接地の失敗に起因する、エンボディド3D推論には転移しない。本論文では、3Dエンボディエージェントにおける幻覚軽減のための、初の推論時視覚的対比復号（Visual Contrastive Decoding）フレームワークである3D-VCDを提案する。3D-VCDは、オブジェクト中心表現に対してカテゴリ置換や座標・寸法の改変などの意味的・幾何学的摂動を適用することで、歪んだ3Dシーングラフを構築する。元の3D文脈と歪んだ3D文脈下での予測を対比することにより、本手法は、接地されたシーン証拠に感応せず、したがって言語事前分布に駆動されやすいトークンを抑制する。3D-POPEおよびHEALベンチマークを用いた評価により、3D-VCDが再学習を一切必要とせずに接地推論を一貫して改善し、構造化された3D表現に対する推論時対比復号が、より信頼性の高いエンボディド知能への効果的かつ実用的な道筋であることを示す。

English

Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.

3D-VCD: 視覚的対照デコーディングによる3D-LLM具現化エージェントの幻覚緩和

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

要旨

Support