ChatPaper.aiChatPaper

PRISMM-Bench:基于同行评审的多模态不一致性基准测试

PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

October 18, 2025
作者: Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin
cs.AI

摘要

大型多模态模型(LMMs)在科学研究中的应用日益广泛,然而它们能否可靠地理解并推理论文中的多模态复杂性仍不明确。核心挑战在于检测并解决文本、图表、表格和公式之间的不一致性,这些问题往往微妙且领域特定,最终削弱了清晰度、可重复性和信任度。现有基准测试忽视了这一问题,要么孤立单一模态,要么依赖合成错误,未能捕捉现实世界的复杂性。我们推出了PRISMM-Bench(基于同行评审的多模态模型不一致性集),这是首个基于科学论文中评审者标记的真实不一致性的基准。通过评审挖掘、LLM辅助过滤和人工验证的多阶段流程,我们从242篇论文中精选了262个不一致性案例。基于此,我们设计了三个任务:不一致性识别、修正及配对匹配,以评估模型在不同模态间检测、纠正和推理不一致性的能力。此外,针对多项选择评估中模型仅利用答案模式而不真正理解问题的顽疾,我们进一步引入了基于JSON的结构化答案表示,通过减少对表面风格线索的依赖,最小化语言偏见。我们对21个领先的LMMs进行了基准测试,包括大型开源权重模型(GLM-4.5V 106B, InternVL3 78B)和专有模型(Gemini 2.5 Pro, GPT-5高推理版)。结果显示,模型表现显著偏低(26.1%-54.2%),凸显了多模态科学推理的挑战,并激励我们朝着可信赖的科学助手方向迈进。
English
Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.
PDF32October 22, 2025