外科AIにおける比較研究：データセット、基盤モデル、およびMed-AGI実現への障壁

要旨

近年、人工知能（AI）モデルは生物医学的タスクの性能において複数のベンチマークで人間の専門家に匹敵あるいは凌駕する成果を示しているが、外科的画像解析のベンチマークでは遅れをとっている。手術には、マルチモーダルデータ統合、人間との相互作用、物理的効果など、多様なタスクの統合が要求されるため、性能が向上すれば、汎用性の高いAIモデルは協調ツールとして特に魅力的なものとなる可能性がある。一方で、アーキテクチャの規模と学習データを拡大するという従来型のアプローチは魅力的である。特に、年間数百万時間に及ぶ外科手術ビデオデータが生成されているためだ。他方、AI学習のための外科データの準備には、はるかに高度な専門知識が要求され、そのデータを用いた学習には高額な計算リソースが必要となる。これらのトレードオフは、現代のAIが外科診療を支援できるかどうか、またどの程度まで支援できるのかについて不確かな見通しを示している。本論文では、2026年時点で利用可能な最先端のAI手法を用いた外科的工具検出のケーススタディを通じてこの問題を探求する。私たちは、数十億パラメータ規模のモデルと大規模な学習を用いても、現在の視覚言語モデルが神経外科における一見単純な工具検出タスクで不十分であることを実証する。さらに、モデルサイズと学習時間を増加させても、関連する性能指標における改善は逓減的にしか進まないことを示すスケーリング実験も行う。したがって、我々の実験は、現在のモデルが外科的ユースケースにおいて依然として重大な障害に直面しうることを示唆している。さらに、いくつかの障害は追加の計算資源で単純に「規模拡大によって解消」できるものではなく、多様なモデルアーキテクチャにわたって持続するため、データとラベルの利用可能性だけが唯一の制限要因なのかという疑問が生じる。我々は、これらの制約の主な要因について議論し、潜在的な解決策を提案する。

English

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks -- including multimodal data integration, human interaction, and physical effects -- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

外科AIにおける比較研究：データセット、基盤モデル、およびMed-AGI実現への障壁

A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

要旨

Support