3D-VCD: 시각적 대비 디코딩을 통한 3D-LLM 구현 에이전트의 환각 현상 완화

초록

대규모 멀티모달 모델은 3D 환경에서 작동하는 구현 에이전트의 추론 코어로 점차 활용되고 있지만, 여전히 환각 현상에 취약하여 안전하지 않고 근거 없는 결정을 내릴 수 있습니다. 기존의 추론 시점 환각 완화 방법은 주로 2D 비전-언어 설정을 대상으로 하며, 픽셀 수준의 불일치보다는 객체 존재, 공간 배치, 기하학적 근거에서 발생하는 오류를 보이는 구현형 3D 추론에는 적용되지 않습니다. 본 연구에서는 3D 구현 에이전트의 환각 완화를 위한 최초의 추론 시점 시각적 대조 디코딩 프레임워크인 3D-VCD를 소개합니다. 3D-VCD는 객체 중심 표현에 의미론적 및 기하학적 변형(예: 범주 치환, 좌표 또는 크기 손상)을 적용하여 왜곡된 3D 장면 그래프를 구성합니다. 원본 3D 컨텍스트와 왜곡된 컨텍스트에서의 예측을 대조함으로써, 본 방법은 근거 있는 장면 증거에 둔감하여 언어 사전 지식에 의해 주로驱动되는 토큰을 억제합니다. 3D-VCD를 3D-POPE 및 HEAL 벤치마크에서 평가한 결과, 재학습 없이도 일관되게 근거 기반 추론 성능을 향상시키는 것으로 나타났으며, 구조화된 3D 표현에 대한 추론 시점 대조 디코딩이 더욱 신뢰할 수 있는 구현 인텔리전스를 위한 효과적이고 실용적인 경로임을 입증했습니다.

English

Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded decisions. Existing inference-time hallucination mitigation methods largely target 2D vision-language settings and do not transfer to embodied 3D reasoning, where failures arise from object presence, spatial layout, and geometric grounding rather than pixel-level inconsistencies. We introduce 3D-VCD, the first inference-time visual contrastive decoding framework for hallucination mitigation in 3D embodied agents. 3D-VCD constructs a distorted 3D scene graph by applying semantic and geometric perturbations to object-centric representations, such as category substitutions and coordinate or extent corruption. By contrasting predictions under the original and distorted 3D contexts, our method suppresses tokens that are insensitive to grounded scene evidence and are therefore likely driven by language priors. We evaluate 3D-VCD on the 3D-POPE and HEAL benchmarks and show that it consistently improves grounded reasoning without any retraining, establishing inference-time contrastive decoding over structured 3D representations as an effective and practical route to more reliable embodied intelligence.

3D-VCD: 시각적 대비 디코딩을 통한 3D-LLM 구현 에이전트의 환각 현상 완화

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

초록

Support