CXReasonBench: 흉부 X선에서의 구조화된 진단 추론 평가를 위한 벤치마크

초록

대형 시각-언어 모델(LVLMs)의 최근 발전은 의료 보고서 생성 및 시각 질문 응답과 같은 의료 작업에서 유망한 응용 가능성을 보여주고 있습니다. 그러나 기존 벤치마크는 주로 최종 진단 결과에 초점을 맞추어, 모델이 임상적으로 의미 있는 추론을 수행하는지에 대한 통찰을 제한적으로 제공합니다. 이를 해결하기 위해, 우리는 공개적으로 이용 가능한 MIMIC-CXR-JPG 데이터셋을 기반으로 한 구조화된 파이프라인과 벤치마크인 CheXStruct와 CXReasonBench를 제안합니다. CheXStruct는 흉부 X-ray에서 직접 중간 추론 단계를 자동으로 도출하며, 이는 해부학적 영역 분할, 해부학적 랜드마크 및 진단 측정 도출, 진단 지수 계산, 그리고 임상적 임계값 적용과 같은 과정을 포함합니다. CXReasonBench는 이 파이프라인을 활용하여 모델이 임상적으로 유효한 추론 단계를 수행할 수 있는지, 그리고 구조화된 지도로부터 어느 정도 학습할 수 있는지를 평가하여, 진단 추론에 대한 세밀하고 투명한 평가를 가능하게 합니다. 이 벤치마크는 12개의 진단 작업과 1,200개의 사례에 걸쳐 18,988개의 질문-응답 쌍으로 구성되며, 각 사례는 최대 4개의 시각적 입력과 짝을 이루고, 해부학적 영역 선택 및 진단 측정을 통한 시각적 근거를 포함한 다중 경로, 다단계 평가를 지원합니다. 평가된 10개의 LVLM 중 가장 강력한 모델조차도 구조화된 추론과 일반화에 어려움을 겪으며, 종종 추상적 지식을 해부학적으로 근거 있는 시각적 해석과 연결하지 못합니다. 코드는 https://github.com/ttumyche/CXReasonBench에서 이용 가능합니다.

English

Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench

CXReasonBench: 흉부 X선에서의 구조화된 진단 추론 평가를 위한 벤치마크

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

초록

Support