ChatPaper.aiChatPaper

PRISM-Bench:基于谜题的视觉任务基准与思维链错误检测

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

October 27, 2025
作者: Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan
cs.AI

摘要

我们推出PRISM-Bench——一个基于谜题的视觉挑战基准,其设计目标不仅在于评估模型能否解决问题,更在于揭示模型的推理过程。与仅衡量最终答案准确性的传统评估方法不同,PRISM-Bench引入了一项诊断性任务:给定一个视觉谜题和包含恰好一处错误的逐步思维链(CoT),模型必须识别出第一个错误步骤。这一设定能对逻辑一致性、错误检测能力和视觉推理进行细粒度评估。该基准中的谜题需要多步骤符号推理、几何推理及类比推理,有效规避了基于表面模式匹配的捷径解法。对前沿多模态大模型(MLLM)的评估表明,流畅生成与忠实推理之间存在持续差距:能生成合理思维链的模型往往难以定位简单的逻辑错误。通过将答案生成与推理验证相分离,PRISM-Bench为评估多模态推理能力提供了更精准的标尺,并凸显了在开发可信MLLM过程中实施诊断性评估机制的必要性。
English
We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.
PDF52December 31, 2025