S-Chain:面向医学的结构化视觉思维链
S-Chain: Structured Visual Chain-of-Thought For Medicine
October 26, 2025
作者: Khai Le-Duc, Duy M. H. Nguyen, Phuong T. H. Trinh, Tien-Phat Nguyen, Nghiem T. Diep, An Ngo, Tung Vu, Trinh Vuong, Anh-Tien Nguyen, Mau Nguyen, Van Trung Hoang, Khai-Nguyen Nguyen, Hy Nguyen, Chris Ngo, Anji Liu, Nhat Ho, Anne-Christin Hauschild, Khanh Xuan Nguyen, Thanh Nguyen-Tang, Pengtao Xie, Daniel Sonntag, James Zou, Mathias Niepert, Anh Totti Nguyen
cs.AI
摘要
医疗视觉语言模型(V-LM)的可靠推理不仅需要精准预测,更要求文本推理与视觉证据间保持透明对齐。尽管思维链(CoT)提示在医疗视觉问答(VQA)中展现出潜力,但尚无大规模专家级数据集能提供具备精确视觉定位的渐进式推理。我们推出首个大规模专家标注数据集S-Chain,包含12,000张带边界框的医学图像及结构化视觉CoT(SV-CoT),显式关联视觉区域与推理步骤。该数据集进一步支持16种语言,总计超70万VQA问答对,具备广泛的多语言适用性。基于S-Chain,我们对前沿医疗V-LM(ExGra-Med、LLaVA-Med)及通用V-LM(Qwen2.5-VL、InternVL2.5)进行基准测试,发现SV-CoT监督能显著提升模型可解释性、定位保真度与鲁棒性。除基准测试外,我们还探究其与检索增强生成的协同效应,揭示自回归推理过程中领域知识与视觉定位的交互机制。最后,我们提出一种强化视觉证据与推理对齐的新机制,同步提升可靠性及效率。S-Chain为 grounded 医疗推理树立新基准,为构建更可信、可解释的医疗V-LM开辟道路。
English
Faithful reasoning in medical vision-language models (VLMs) requires not only
accurate predictions but also transparent alignment between textual rationales
and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise
in medical visual question answering (VQA), no large-scale expert-level dataset
has captured stepwise reasoning with precise visual grounding. We introduce
S-Chain, the first large-scale dataset of 12,000 expert-annotated medical
images with bounding boxes and structured visual CoT (SV-CoT), explicitly
linking visual regions to reasoning steps. The dataset further supports 16
languages, totaling over 700k VQA pairs for broad multilingual applicability.
Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med,
LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that
SV-CoT supervision significantly improves interpretability, grounding fidelity,
and robustness. Beyond benchmarking, we study its synergy with
retrieval-augmented generation, revealing how domain knowledge and visual
grounding interact during autoregressive reasoning. Finally, we propose a new
mechanism that strengthens the alignment between visual evidence and reasoning,
improving both reliability and efficiency. S-Chain establishes a new benchmark
for grounded medical reasoning and paves the way toward more trustworthy and
explainable medical VLMs.