ChatPaper.aiChatPaper

PRISMM-Bench:基於同行評審的多模態不一致性基準測試

PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

October 18, 2025
作者: Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin
cs.AI

摘要

大型多模态模型(LMMs)在科学研究中的应用日益广泛,然而它们是否能够可靠地理解并推理论文中的多模态复杂性仍不明确。一个核心挑战在于检测并解决文本、图表、表格和公式之间的不一致性,这些问题往往微妙且领域特定,最终会削弱清晰度、可重复性和信任度。现有基准测试忽视了这一问题,要么孤立单一模态,要么依赖无法捕捉现实世界复杂性的合成错误。我们引入了PRISMM-Bench(同行评审来源的多模态模型不一致性集),这是首个基于科学论文中评审者标记的真实不一致性的基准测试。通过评审挖掘、LLM辅助过滤和人工验证的多阶段流程,我们从242篇论文中精选了262个不一致性案例。基于此,我们设计了三个任务,即不一致性识别、修正和配对匹配,以评估模型在不同模态间检测、纠正和推理不一致性的能力。此外,针对多选评估中模型利用答案模式而非真正理解问题的臭名昭著的选择捷径问题,我们进一步引入了基于JSON的结构化答案表示,通过减少对表面风格线索的依赖,最小化语言偏见。我们对21个领先的LMMs进行了基准测试,包括大型开放权重模型(GLM-4.5V 106B, InternVL3 78B)和专有模型(Gemini 2.5 Pro, GPT-5高推理版)。结果显示性能极低(26.1-54.2%),凸显了多模态科学推理的挑战,并激励我们朝着可信赖的科学助手方向迈进。
English
Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.
PDF32October 22, 2025