ECG推論ベンチマーク：心電図解釈における臨床推論能力を評価するための基準

要旨

マルチモーダル大規模言語モデル（MLLM）は心電図自動解析において有望な性能を示すが、それらが実際に段階的な推論を行っているのか、あるいは表面的な視覚的手がかりに依存しているだけなのかは不明である。この問題を検証するため、我々はECG-Reasoning-Benchmarkを提案する。これは17の主要な心電図診断にわたる段階的推論を体系的に評価する、6,400サンプル以上からなる新しいマルチターン評価フレームワークである。最先端モデルに対する包括的評価により、多段階の論理的推論の実行において重大な欠陥が明らかになった。モデルは診断に必要な臨床基準を抽出する医学的知識を有するものの、完全な推論連鎖を維持する成功率（Completion率6%）はほぼゼロであり、主に、対応する心電図所見を実際の心電図信号における視覚的証拠に基づいて立証することに失敗していた。これらの結果は、現在のMLLMが実際の視覚的解釈を回避していることを示し、既存の学習パラダイムにおける重大な欠陥を露呈するとともに、堅牢な推論中心の医療AIの必要性を強調するものである。コードとデータはhttps://github.com/Jwoo5/ecg-reasoning-benchmark で公開されている。

English

While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step-by-step reasoning or just rely on superficial visual cues. To investigate this, we introduce ECG-Reasoning-Benchmark, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near-zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results demonstrate that current MLLMs bypass actual visual interpretation, exposing a critical flaw in existing training paradigms and underscoring the necessity for robust, reasoning-centric medical AI. The code and data are available at https://github.com/Jwoo5/ecg-reasoning-benchmark.

ECG推論ベンチマーク：心電図解釈における臨床推論能力を評価するための基準

ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

要旨

Support