外科人工智能比较研究：数据集、基础模型与医疗通用人工智能的壁垒

摘要

近年来，人工智能（AI）模型在多项生物医学任务性能基准测试中已媲美甚至超越人类专家，但在外科图像分析基准测试中仍处于落后状态。由于外科手术需要整合多模态数据融合、人机交互和物理效应等差异化任务，若能提升性能，具备通用能力的人工智能模型有望成为极具吸引力的协作工具。一方面，扩大架构规模与训练数据量的经典方法颇具吸引力——特别是考虑到全球每年产生数百万小时的手术视频数据。另一方面，为AI训练准备外科数据需要更高水平的专业素养，且基于这些数据的训练需消耗昂贵的计算资源。这些权衡因素使得现代AI能否助力外科实践、以及能在多大程度上发挥作用仍不明朗。本文通过2026年最先进AI技术在外科器械检测中的案例研究探讨该问题。我们发现，即便是拥有数十亿参数并经过广泛训练的视觉语言模型，在神经外科器械检测这一看似简单的任务中仍表现不佳。此外，缩放实验表明增加模型规模与训练时长仅能带来相关性能指标的边际改善。因此，我们的实验提示当前模型在外科应用场景中仍面临显著障碍。更重要的是，某些障碍无法通过增加算力简单"缩放"解决，且在不同模型架构中持续存在，这引发了对数据与标注可用性是否构成唯一限制因素的质疑。我们深入探讨了这些约束条件的主要成因，并提出了潜在解决方案。

English

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks -- including multimodal data integration, human interaction, and physical effects -- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

外科人工智能比较研究：数据集、基础模型与医疗通用人工智能的壁垒

A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

摘要

Support