ECG-Reasoning-Benchmark: 심전도 해석에서 임상 추론 능력 평가를 위한 벤치마크

초록

다중모드 대규모 언어 모델(MLLM)이 자동 심전도 판독에서 유망한 성능을 보이지만, 이러한 모델이 실제로 단계별 추론을 수행하는지 아니면 단순히 표면적 시각 단서에 의존하는지 여부는 여전히 불분명합니다. 이를 규명하기 위해 우리는 17가지 핵심 심전도 판별 영역에 걸쳐 단계별 추론 능력을 체계적으로 평가하는 6,400개 이상의 샘플로 구성된 새로운 다중턴 평가 프레임워크인 ECG-Reasoning-Benchmark를 소개합니다. 최첨단 모델에 대한 종합 평가 결과, 모델이 다단계 논리적 추론을 실행하는 데 있어 심각한 결함이 드러났습니다. 모델이 진단에 필요한 임상 기준을 회상하는 의학 지식은 보유하고 있으나, 완전한 추론 사슬을 유지하는 데는 거의 제로에 가까운 성공률(Completion 6%)을 보였으며, 이는 주로 해당 심전도 소견을 실제 심전도 신호의 시각적 증거에 정확히 연결하지 못하기 때문입니다. 이러한 결과는 현재의 MLLM이 실제 시각 해석을 생략하고 있음을 보여주며, 이는 기존 훈련 패러다임의 치명적 결함을 노출함과 동시에 강력한 추론 중심 의료 AI의 필요성을 강조합니다. 코드와 데이터는 https://github.com/Jwoo5/ecg-reasoning-benchmark에서 확인할 수 있습니다.

English

While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step-by-step reasoning or just rely on superficial visual cues. To investigate this, we introduce ECG-Reasoning-Benchmark, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near-zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning-centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg-reasoning-benchmark.

ECG-Reasoning-Benchmark: 심전도 해석에서 임상 추론 능력 평가를 위한 벤치마크

ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

초록

Support