外科人工智能比较研究:数据集、基础模型与医疗通用人工智能发展壁垒
A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI
March 28, 2026
作者: Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe
cs.AI
摘要
近年来,人工智能(AI)模型在多项生物医学任务基准测试中已达到或超越人类专家水平,但在外科图像分析基准方面仍显滞后。由于外科手术需要整合多模态数据融合、人机交互及物理效应等多元任务,若能提升性能,通用型AI模型有望成为理想的手术协作工具。一方面,扩大模型架构规模与训练数据量的经典方法颇具吸引力——特别是考虑到全球每年产生数百万小时的手术视频数据。另一方面,为AI训练准备外科数据需要更高水平的专业支持,且基于此类数据的训练需消耗昂贵的计算资源。这些权衡因素使得现代AI能否助力外科实践、以及其助力程度存在不确定性。本文通过2026年最先进的AI方法进行手术器械检测的案例研究,探讨这一问题。我们证明,即便是拥有数十亿参数并经过广泛训练的视觉语言模型,在神经外科器械检测这一看似简单的任务中仍存在不足。此外,缩放实验表明,增加模型规模与训练时长仅能带来相关性能指标的边际改善。因此,我们的实验提示当前模型在外科应用场景中仍面临显著障碍。值得注意的是,部分障碍无法通过增加算力简单"缩放消除",且在不同模型架构中持续存在,这引发了对数据与标注可用性是否为唯一限制因素的质疑。我们深入探讨了这些约束条件的主要成因,并提出了潜在解决方案。
English
Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks -- including multimodal data integration, human interaction, and physical effects -- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.