PRISM-Bench:基于谜题的视觉任务基准与思维链错误检测
PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
October 27, 2025
作者: Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan
cs.AI
摘要
我们推出PRISM-Bench——一个基于谜题的视觉挑战基准,其设计目标不仅是评估模型能否解决问题,更能揭示模型的推理过程。与仅衡量最终答案准确度的传统评估不同,PRISM-Bench引入了一项诊断性任务:给定一个视觉谜题和包含恰好一处错误的逐步思维链(CoT),模型必须识别出首个错误步骤。这一设定能对逻辑一致性、错误检测能力和视觉推理进行细粒度评估。PRISM-Bench中的谜题需要多步骤的符号推理、几何推理和类比推理,有效规避了基于表面模式匹配的解题捷径。对前沿多模态大模型(MLLM)的评估表明,流畅生成与可靠推理之间存在持续差距:能生成合理思维链的模型往往难以定位简单的逻辑谬误。通过将答案生成与推理验证相分离,PRISM-Bench为评估多模态推理能力提供了更精准的视角,并凸显了在开发可信赖多模态大模型过程中实施诊断性评估方案的必要性。
English
We introduce PRISM-Bench, a benchmark of puzzle-based visual
challenges designed to evaluate not only whether models can solve problems, but
how their reasoning unfolds. Unlike prior evaluations that measure only
final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual
puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error,
models must identify the first incorrect step. This setting enables
fine-grained assessment of logical consistency, error detection, and visual
reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric,
and analogical reasoning, resisting shortcuts based on superficial pattern
matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap
between fluent generation and faithful reasoning: models that produce plausible
CoTs often fail to locate simple logical faults. By disentangling answer
generation from reasoning verification, PRISM-Bench offers a sharper lens on
multimodal reasoning competence and underscores the need for diagnostic
evaluation protocols in the development of trustworthy MLLMs.