より多くの思考、より少ない視覚？マルチモーダル推論モデルにおける増幅された幻覚の評価

要旨

テスト時の計算能力は、マルチモーダル大規模言語モデルに拡張された推論チェーンの生成を可能にし、マルチモーダル数学推論などのタスクで高いパフォーマンスを発揮しています。しかし、この推論能力の向上はしばしば幻覚（hallucination）の増加を伴います：生成が長くなるにつれて、モデルは画像に基づいた内容から離れ、言語の事前知識に依存する傾向が強まります。注意分析によると、長い推論チェーンは視覚的入力への焦点を減少させ、これが幻覚の一因となっています。この現象を体系的に研究するため、我々はRH-AUCという指標を導入しました。これは、モデルの知覚精度が推論の長さに応じてどのように変化するかを定量化し、推論中に視覚的基盤を保持しているかどうかを評価するものです。また、推論能力と幻覚のトレードオフを評価するために設計された、様々なマルチモーダルタスクを網羅する診断ベンチマークRH-Benchをリリースしました。我々の分析は、(i) より大きなモデルは通常、推論と知覚のバランスをより良く達成し、(ii) このバランスは訓練データの総量よりも、その種類やドメインに大きく影響されることを明らかにしています。これらの発見は、推論の質と知覚の忠実度を同時に考慮する評価フレームワークの重要性を強調しています。

English

Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.

より多くの思考、より少ない視覚？マルチモーダル推論モデルにおける増幅された幻覚の評価

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

要旨

Support