불확실성 속 사고: 잠재 엔트로피 인식 디코딩을 통한 MLRM 환각 현상 완화

초록

최근 멀티모달 대규모 추론 모델(MLRM)의 발전으로 시각 질의응답 성능이 크게 향상되었습니다. 그러나 전환어(예: because, however, wait)는 허구적 응답과 밀접하게 연관되어 있으며 높은 엔트로피 상태를 보이는 경향이 있음을 관찰했습니다. 본 연구에서는 토큰 확률 분포에서 적절한 맥락적 추론 정보를 직접 추출할 수 있다고 주장합니다. 중첩 표현 이론에 착안하여, 잠재적 중첩 추론을 활용해 다중 후보 의미를 통합하고 잠재적 추론 궤적을 유지하는 방법을 제안합니다. 이에 따른 가설은 이산적 텍스트 입력에 대한 의존성이 모델을 순차적 명시적 추론으로 유도하여, 높은 엔트로피 추론 단계에서 밀집된 맥락적 단서를 충분히 활용하지 못하게 할 수 있다는 것입니다. 따라서 토큰 확률 분포에서 풍부한 의미 표현을 구축하여 맥락 내 추론을 강화하고자 합니다. 이를 위해 본 논문에서는 의미적 맥락을 활용해 신뢰할 수 있는 추론을 달성하는 효율적인 플러그앤플레이 디코딩 전략인 잠재 엔트로피 인식 디코딩(LEAD)을 제안합니다. 본 방법론의 핵심은 엔트로피 인식 추론 모드 전환에 있습니다. 모델은 높은 엔트로피 상태에서 확률 가중 연속 임베딩을 사용하며, 엔트로피가 감소함에 따라 이산 토큰 임베딩으로 전환됩니다. 더불어 모델이 시각 정보에 집중하도록 유도하는 사전 지도 시각 앵커 주입 전략을 제안합니다. 다양한 벤치마크에서 여러 MLRM에 걸친 폭넓은 실험을 통해 LEAD가 허구적 응답을 효과적으로 완화함을 입증했습니다.

English

Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

불확실성 속 사고: 잠재 엔트로피 인식 디코딩을 통한 MLRM 환각 현상 완화

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

초록

Support