在不確定性中思考:透過潛在熵感知解碼緩解多語言大型語言模型的幻覺問題
Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding
March 9, 2026
作者: Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, Yifan Peng, Mingquan Lin, Zongyuan Ge
cs.AI
摘要
近期多模態大型推理模型(MLRMs)的進展顯著提升了視覺問答任務的表現。然而我們觀察到,轉折詞(例如「因為」「但是」「且慢」)與幻覺現象密切相關,且易呈現高熵狀態。我們認為,充足的上下文推理資訊可直接從詞元機率分佈中提取。受疊加表徵理論啟發,我們提出利用潛在疊加推理來整合多重候選語義,並維持潛在的推理軌跡。我們的假設是:對離散文本輸入的依賴可能驅使模型趨向序列化顯式推理,從而在高熵推理階段未能充分利用密集的上下文線索。因此,我們提出從詞元機率分佈構建豐富語義表徵,以增強上下文推理能力。基於此目標,我們提出潛在熵感知解碼(LEAD)——一種高效的即插即用解碼策略,通過語義上下文實現可靠推理。該方法的核心在於熵感知推理模式切換:模型在高熵狀態下採用機率加權的連續嵌入表徵,並在熵值降低時切換回離散詞元嵌入。此外,我們提出先驗引導的視覺錨點注入策略,促使模型聚焦視覺資訊。大量實驗表明,LEAD在多個基準測試中有效減輕了各類MLRMs的幻覺現象。
English
Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.