多思少看？评估多模态推理模型中的放大幻觉现象

摘要

测试时计算能力的提升，使得多模态大语言模型能够生成更长的推理链，从而在诸如多模态数学推理等任务上展现出强劲性能。然而，这种增强的推理能力往往伴随着幻觉的增加：随着生成内容变长，模型倾向于偏离图像基础内容，更多地依赖语言先验。注意力分析表明，较长的推理链导致对视觉输入的关注度降低，这加剧了幻觉现象。为系统研究这一现象，我们引入了RH-AUC指标，它量化了模型感知准确率随推理长度变化的情况，使我们能评估模型在推理过程中是否保持了视觉基础。同时，我们发布了RH-Bench诊断基准，涵盖多种多模态任务，旨在评估推理能力与幻觉之间的权衡。我们的分析揭示：(i) 更大模型通常在推理与感知间达到更好的平衡；(ii) 这种平衡更多受训练数据的类型和领域影响，而非其总体数量。这些发现强调了同时考虑推理质量与感知保真度的评估框架的重要性。

English

Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.

多思少看？评估多模态推理模型中的放大幻觉现象

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

摘要

Support