CXReasonBench：一個用於評估胸部X光結構化診斷推理的基準

摘要

近期，大型视觉语言模型（LVLMs）的进展为医疗任务带来了广阔的应用前景，如报告生成和视觉问答。然而，现有的基准测试主要聚焦于最终的诊断结果，对模型是否进行了具有临床意义的推理提供的信息有限。为此，我们提出了基于公开可用的MIMIC-CXR-JPG数据集的CheXStruct和CXReasonBench，一个结构化流程与基准测试。CheXStruct能够直接从胸部X光片中自动推导出一系列中间推理步骤，包括解剖区域的分割、解剖标志与诊断测量值的提取、诊断指标的计算以及临床阈值的应用。CXReasonBench利用这一流程来评估模型是否能执行临床有效的推理步骤，以及它们能在多大程度上从结构化指导中学习，从而实现对诊断推理的细粒度与透明化评估。该基准测试包含12项诊断任务中的18,988个问答对及1,200个病例，每个病例最多配有4个视觉输入，并支持多路径、多阶段的评估，包括通过解剖区域选择和诊断测量进行的视觉定位。即便在评估的10个LVLMs中最强者，也在结构化推理和泛化能力上表现挣扎，常常难以将抽象知识与基于解剖学的视觉解读联系起来。代码已发布于https://github.com/ttumyche/CXReasonBench。

English

Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench