ClinHallu：用于诊断医学MLLM推理中分阶段幻觉的基准

摘要

构建值得信赖的医学多模态大语言模型（MLLMs）对于可靠的临床决策支持至关重要。现有的医学幻觉基准测试主要关注数据收集，但往往忽略了幻觉在推理过程中的起源位置。我们发现幻觉来源因样本而异：错误可能源于视觉识别错误、医学知识回忆不准确，或推理整合中的缺陷。为了实现源头级幻觉诊断，我们引入了ClinHallu，这是一个用于医学MLLM推理中分阶段幻觉诊断的基准测试。ClinHallu包含7,031个经过验证的实例，每个实例都配有结构化的推理轨迹，分解为视觉识别、知识回忆和推理整合三个阶段。我们还采用阶段替换干预方法，评估纠正特定阶段对最终答案的影响。除评估外，研究表明轨迹监督微调能够减少阶段幻觉。ClinHallu为诊断和缓解医学MLLM中的推理失败提供了一个细粒度的幻觉测试平台。该基准测试已在https://github.com/alibaba-damo-academy/ClinHallu上公开提供。

English

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.