PRISMM-Bench:基于同行评审的多模态不一致性基准测试
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies
October 18, 2025
作者: Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin
cs.AI
摘要
大型多模态模型(LMMs)在科学研究中的应用日益广泛,然而它们能否可靠地理解并推理论文中的多模态复杂性仍不明确。核心挑战在于检测并解决文本、图表、表格和公式之间的不一致性,这些问题往往微妙且领域特定,最终削弱了清晰度、可重复性和信任度。现有基准测试忽视了这一问题,要么孤立单一模态,要么依赖合成错误,未能捕捉现实世界的复杂性。我们推出了PRISMM-Bench(基于同行评审的多模态模型不一致性集),这是首个基于科学论文中评审者标记的真实不一致性的基准。通过评审挖掘、LLM辅助过滤和人工验证的多阶段流程,我们从242篇论文中精选了262个不一致性案例。基于此,我们设计了三个任务:不一致性识别、修正及配对匹配,以评估模型在不同模态间检测、纠正和推理不一致性的能力。此外,针对多项选择评估中模型仅利用答案模式而不真正理解问题的顽疾,我们进一步引入了基于JSON的结构化答案表示,通过减少对表面风格线索的依赖,最小化语言偏见。我们对21个领先的LMMs进行了基准测试,包括大型开源权重模型(GLM-4.5V 106B, InternVL3 78B)和专有模型(Gemini 2.5 Pro, GPT-5高推理版)。结果显示,模型表现显著偏低(26.1%-54.2%),凸显了多模态科学推理的挑战,并激励我们朝着可信赖的科学助手方向迈进。
English
Large Multimodal Models (LMMs) are increasingly applied to scientific
research, yet it remains unclear whether they can reliably understand and
reason over the multimodal complexity of papers. A central challenge lies in
detecting and resolving inconsistencies across text, figures, tables, and
equations, issues that are often subtle, domain-specific, and ultimately
undermine clarity, reproducibility, and trust. Existing benchmarks overlook
this issue, either isolating single modalities or relying on synthetic errors
that fail to capture real-world complexity. We introduce PRISMM-Bench
(Peer-Review-sourced Inconsistency Set for Multimodal Models), the first
benchmark grounded in real reviewer-flagged inconsistencies in scientific
papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering
and human verification, we curate 262 inconsistencies from 242 papers. Based on
this set, we design three tasks, namely inconsistency identification, remedy
and pair matching, which assess a model's capacity to detect, correct, and
reason over inconsistencies across different modalities. Furthermore, to
address the notorious problem of choice-only shortcuts in multiple-choice
evaluation, where models exploit answer patterns without truly understanding
the question, we further introduce structured JSON-based answer representations
that minimize linguistic biases by reducing reliance on superficial stylistic
cues. We benchmark 21 leading LMMs, including large open-weight models
(GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5
with high reasoning). Results reveal strikingly low performance (26.1-54.2%),
underscoring the challenge of multimodal scientific reasoning and motivating
progress towards trustworthy scientific assistants.