ClinHallu: 의료 MLLM 추론에서 단계별 환각을 진단하기 위한 벤치마크

초록

신뢰할 수 있는 의료 멀티모달 대규모 언어 모델(MLLM)을 구축하는 것은 안정적인 임상 의사 결정 지원에 필수적이다. 기존 의료 할루시네이션 벤치마크는 주로 데이터 수집에 초점을 맞추지만, 추론 과정 내에서 할루시네이션이 어디서 발생하는지는 종종 간과한다. 우리는 할루시네이션 소스가 샘플에 따라 다양하다는 점을 발견했다. 오류는 시각 인식 오류, 부정확한 의학 지식 회상, 또는 결함 있는 추론 통합에서 발생할 수 있다. 소스 수준의 할루시네이션 진단을 가능하게 하기 위해, 우리는 의료 MLLM 추론에서 단계별 할루시네이션 진단을 위한 벤치마크인 ClinHallu를 도입한다. ClinHallu는 7,031개의 검증된 인스턴스를 포함하며, 각 인스턴스는 시각 인식, 지식 회상, 추론 통합으로 분해된 구조화된 추론 트레이스로 보강된다. 또한 특정 단계를 교정했을 때 최종 답변에 미치는 영향을 측정하기 위해 단계 대체 개입을 사용한다. 평가를 넘어, 우리는 추적 감독 미세 조정이 단계별 할루시네이션을 줄인다는 것을 보여준다. ClinHallu는 의료 MLLM에서 추론 실패를 진단하고 완화하기 위한 세분화된 할루시네이션 테스트베드를 제공한다. 이 벤치마크는 https://github.com/alibaba-damo-academy/ClinHallu에서 공개적으로 이용 가능하다.

English

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.