ChatPaper.aiChatPaper

ECG推理基准:心电图解读中临床推理能力的评估标准

ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

March 15, 2026
作者: Jungwoo Oh, Hyunseung Chung, Junhee Lee, Min-Gyu Kim, Hangyul Yoon, Ki Seong Lee, Youngchae Lee, Muhan Yeo, Edward Choi
cs.AI

摘要

尽管多模态大语言模型在自动化心电图解读方面展现出良好性能,但其是否真正执行逐步推理还是仅依赖表层视觉特征仍不明确。为探究这一问题,我们推出ECG-Reasoning-Benchmark——一个包含6,400余个样本的新型多轮评估框架,系统评估涵盖17项核心心电图诊断的逐步推理能力。对前沿模型的综合评估揭示了其在执行多步骤逻辑推导方面存在严重缺陷:尽管模型具备检索诊断所需临床标准的医学知识,但在维持完整推理链方面成功率趋近于零(完成度仅6%),主要失败于将对应心电图发现与实际信号中的视觉证据相锚定。这些结果表明当前MLLMs规避了真正的视觉解读,暴露出现有训练范式的关键缺陷,同时凸显了构建以推理为核心的强健医疗AI的必要性。代码与数据详见https://github.com/Jwoo5/ecg-reasoning-benchmark。
English
While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step-by-step reasoning or just rely on superficial visual cues. To investigate this, we introduce ECG-Reasoning-Benchmark, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near-zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning-centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg-reasoning-benchmark.
PDF11March 19, 2026