ChatPaper.aiChatPaper

S-Chain:面向医学的结构化视觉思维链

S-Chain: Structured Visual Chain-of-Thought For Medicine

October 26, 2025
作者: Khai Le-Duc, Duy M. H. Nguyen, Phuong T. H. Trinh, Tien-Phat Nguyen, Nghiem T. Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Mau Nguyen, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen, Thanh Nguyen-Tang, Pengtao Xie, Daniel Sonntag, James Zou, Mathias Niepert, Anh Totti Nguyen
cs.AI

摘要

医学视觉语言模型(V-LM)的可靠推理不仅需要精准预测,更需实现文本依据与视觉证据之间的透明对齐。尽管思维链(CoT)提示在医学视觉问答(VQA)中展现出潜力,但目前尚无大规模专家级数据集能通过精确视觉定位呈现渐进式推理。我们推出首个大规模专家标注数据集S-Chain,包含12,000张带有边界框和结构化视觉CoT(SV-CoT)的医学图像,明确将视觉区域与推理步骤相连接。该数据集进一步支持16种语言,总计超过70万组VQA问答对,具备广泛的多语言适用性。基于S-Chain,我们对前沿医学V-LM(ExGra-Med、LLaVA-Med)及通用V-LM(Qwen2.5-VL、InternVL2.5)进行基准测试,证明SV-CoT监督能显著提升模型可解释性、定位保真度与鲁棒性。除基准测试外,我们还探究其与检索增强生成的协同效应,揭示自回归推理过程中领域知识与视觉定位的交互机制。最后,我们提出一种新机制以强化视觉证据与推理的对齐,同步提升可靠性与效率。S-Chain为医学领域扎根推理树立了新基准,为构建更可信、可解释的医学V-LM开辟了道路。
English
Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness. Beyond benchmarking, we study its synergy with retrieval-augmented generation, revealing how domain knowledge and visual grounding interact during autoregressive reasoning. Finally, we propose a new mechanism that strengthens the alignment between visual evidence and reasoning, improving both reliability and efficiency. S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs.
PDF22December 1, 2025