ClinHallu：診斷醫學MLLM推理中階段性幻覺的基準

摘要

建立可信賴的醫療多模態大型語言模型（MLLMs）對於可靠的臨床決策支援至關重要。現有的醫療幻覺基準主要聚焦於資料收集，但往往忽略推理過程中幻覺的起源。我們發現，幻覺來源因樣本而異：錯誤可能來自於視覺辨識錯誤、不正確的醫學知識回憶，或是推理整合上的缺陷。為實現源頭層級的幻覺診斷，我們提出 ClinHallu，一個用於醫療 MLLM 推理過程階段性幻覺診斷的基準。ClinHallu 包含 7,031 個經驗證的實例，每個實例都附有分解為「視覺辨識」、「知識回憶」及「推理整合」三個階段的結構化推理軌跡。我們也採用階段替換干預方法，測量修正特定階段對最終答案的影響。除評估外，我們證明基於軌跡的微調能減少階段性幻覺。ClinHallu 提供了一個細粒度的幻覺測試平台，用於診斷並緩解醫療 MLLM 的推理失敗。該基準已公開於 https://github.com/alibaba-damo-academy/ClinHallu。

English

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.