ChatPaper.aiChatPaper

更多思考,更少觀察?評估多模態推理模型中的放大幻覺現象

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

May 23, 2025
作者: Chengzhi Liu, Zhongxing Xu, Qingyue Wei, Juncheng Wu, James Zou, Xin Eric Wang, Yuyin Zhou, Sheng Liu
cs.AI

摘要

測試時計算能力的提升,賦予了多模態大型語言模型生成延伸推理鏈的能力,從而在多模態數學推理等任務上展現出強勁性能。然而,這種增強的推理能力往往伴隨著幻覺現象的增加:隨著生成內容的延長,模型傾向於偏離基於圖像的內容,更多地依賴語言先驗。注意力分析顯示,更長的推理鏈導致對視覺輸入的關注度降低,這加劇了幻覺的產生。為系統研究這一現象,我們引入了RH-AUC指標,該指標量化了模型感知準確性隨推理長度的變化,使我們能夠評估模型在推理過程中是否保持了視覺基礎。同時,我們發布了RH-Bench,這是一個涵蓋多種多模態任務的診斷基準,旨在評估推理能力與幻覺之間的權衡。我們的分析揭示:(i) 更大的模型通常在推理與感知之間達到更好的平衡;(ii) 這種平衡更多地受到訓練數據類型和領域的影響,而非其總體數量。這些發現強調了評估框架需同時考慮推理質量與感知保真度的重要性。
English
Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.
PDF142June 2, 2025