心电图临床推理基准：评估心电图判读中临床推理能力的基准平台

摘要

尽管多模态大语言模型在自动化心电图解读方面展现出潜力，但其究竟是在执行真正的逐步推理还是仅依赖表层视觉特征仍不明确。为探究此问题，我们推出ECG-Reasoning-Benchmark——一个包含6,400余个样本的新型多轮评估框架，系统性地评估涵盖17种核心心电图诊断的逐步推理能力。通过对前沿模型的综合评估，我们发现其在执行多步骤逻辑推导方面存在严重缺陷：虽然模型具备检索诊断所需临床标准的医学知识，但在维持完整推理链方面成功率趋近于零（完成度仅6%），主要失败于将对应心电图发现与实际信号中的视觉证据相锚定。这些结果表明当前多模态大语言模型规避了真正的视觉解读，暴露出现有训练范式的关键缺陷，并凸显了构建以推理为核心能力的稳健医疗人工智能的必要性。代码与数据详见https://github.com/Jwoo5/ecg-reasoning-benchmark。

English

While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step-by-step reasoning or just rely on superficial visual cues. To investigate this, we introduce ECG-Reasoning-Benchmark, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near-zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning-centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg-reasoning-benchmark.

心电图临床推理基准：评估心电图判读中临床推理能力的基准平台

ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

摘要

Support