CXReasonBench：胸部X光结构化诊断推理评估基准

摘要

近期，大型视觉语言模型（LVLMs）的进展为医疗任务带来了诸多应用前景，如报告生成和视觉问答。然而，现有基准主要关注最终的诊断结果，对模型是否进行临床上有意义的推理提供的信息有限。为此，我们提出了基于公开可用的MIMIC-CXR-JPG数据集的CheXStruct和CXReasonBench，一个结构化流程与基准测试。CheXStruct直接从胸部X光片中自动推导出一系列中间推理步骤，如分割解剖区域、提取解剖标志和诊断测量值、计算诊断指标以及应用临床阈值。CXReasonBench利用这一流程评估模型能否执行临床有效的推理步骤，以及它们能在多大程度上从结构化指导中学习，从而实现对诊断推理的细粒度和透明化评估。该基准包含12项诊断任务中的18,988个问答对及1,200个案例，每个案例最多配以4个视觉输入，支持多路径、多阶段评估，包括通过解剖区域选择和诊断测量进行的视觉定位。即便在评估的10个最强LVLMs中，模型在结构化推理和泛化方面仍面临挑战，往往难以将抽象知识与基于解剖学的视觉解释联系起来。代码可在https://github.com/ttumyche/CXReasonBench获取。

English

Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench