Colon-X:从多模态理解到临床推理——智能结肠镜技术的进阶之路
Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning
December 3, 2025
作者: Ge-Peng Ji, Jingyi Liu, Deng-Ping Fan, Nick Barnes
cs.AI
摘要
本研究提出Colon-X,这是一个旨在推进结肠镜多模态智能发展的开放计划。我们首先构建了ColonVQA——迄今为止最全面的结肠镜多模态数据集,涵盖76种临床发现和18项多模态任务,包含超过110万条视觉问答条目。除了作为学界数据基础之外,我们进一步探究了结肠镜领域关键却尚未充分探索的转型方向:从多模态理解向临床推理的演进。(a)为呈现当前多模态理解行为的发展现状,我们系统评估了22个多模态大语言模型的泛化能力,并检验其在人为干扰下的可靠性。结果表明,主流MLLMs的临床输出仍远未达到稳健可信的水平。(b)为缩小这一差距,我们深入探索了针对结肠镜的推理中心化智能。具体而言,我们通过多专家辩论流程标注构建了临床推理数据集ColonReason,并开发了首个体现实R1风格推理能力的模型ColonR1,该模型融合了任务自适应奖励与梯度稳定优化技术。在数据稀缺条件下,ColonR1以56.61%的综合准确率超越监督微调方法25.22%,为多模态结肠镜分析设立了全新的推理能力基线。所有数据与模型资源已公开于https://github.com/ai4colonoscopy/Colon-X。
English
In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.