不確実性に基づく思考：潜在エントロピーを考慮したデコーディングによるMLRMの幻覚現象の軽減

要旨

近年、マルチモーダル大規模推論モデル（MLRM）の進展により、視覚質問応答タスクの性能が大幅に向上している。しかしながら、接続詞（例：なぜなら、しかし、待て）が幻覚（ハルシネーション）と密接に関連し、高エントロピー状態を示す傾向があることが観察される。我々は、トークン確率分布から適切な文脈推論情報を直接抽出できると主張する。重ね合わせ表現理論に着想を得て、複数の候補意味を統合し潜在的な推論軌跡を維持するために、潜在的重ね合わせ推論を活用することを提案する。離散的なテキスト入力を過度に依存することが、高エントロピー推論段階において密な文脈手がかりを十分に活用せず、モデルを逐次的な明示的推論へ向かわせる可能性があるという仮説を立てる。そこで、文脈内推論を強化するためにトークン確率分布から豊富な意味表現を構築することを提案する。この目的に向け、意味的文脈を活用して信頼性の高い推論を実現する効率的なプラグアンドプレイ型デコーディング手法であるLatent Entropy-Aware Decoding（LEAD）を提案する。本手法の中核は、エントロピーを考慮した推論モード切替えにある。モデルは高エントロピー状態下では確率重み付き連続埋め込みを採用し、エントロピーが減少するにつれて離散トークン埋め込みへ移行する。さらに、視覚情報に注目するようモデルを促す事前知識誘導型視覚アンカー注入戦略を提案する。大規模な実験により、LEADが複数のベンチマークにおいて様々なMLRMにわたって幻覚を効果的に軽減することを実証する。

English

Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

不確実性に基づく思考：潜在エントロピーを考慮したデコーディングによるMLRMの幻覚現象の軽減

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

要旨

Support