수술 AI 비교 연구: 데이터셋, 파운데이션 모델, 그리고 Med-AGI의 장벽

초록

최근 인공지능(AI) 모델들은 생의학 분야 과업 수행 능력 벤치마크 여러 곳에서 인간 전문가를 따라잡거나 능가하는 성과를 보였으나, 수술 영상 분석 벤치마크에서는 여전히 뒤처지는 모습을 보이고 있습니다. 수술은 다중 모드 데이터 통합, 인간 상호작용, 물리적 효과 등 다양한 과업의 통합을 필요로 하기 때문에, 성능이 개선된다면 범용 AI 모델이 협업 도구로서 특히 매력적일 수 있습니다. 한편으로는 아키텍처 규모와 학습 데이터를 확장하는 전통적인 접근법이 매력적입니다. 특히 매년 수백만 시간에 달하는 수술 영상 데이터가 생성되기 때문입니다. 다른 한편으로는 AI 학습을 위한 수술 데이터 준비에는 훨씬 더 높은 수준의 전문 지식이 필요하며, 해당 데이터로 학습을 진행하려면 고가의 컴퓨팅 자원이 필요합니다. 이러한 상충 관계로 인해 현대 AI가 수술 실무를 지원할 수 있는지, 그리고 어느 정도까지 지원 가능한지에 대한 전망은 불확실한 상태입니다. 본 논문에서는 2026년 현재 최첨단 AI 방법론을 활용한 수술 도구 탐지 사례 연구를 통해 이 문제를 탐구합니다. 우리는 수십억 개의 매개변수를 가진 모델과 방대한 학습을 통해 조차도, 현재의 시각-언어 모델들이 신경수술에서의 도구 탐지라는 겉보기에는 단순한 과업에서도 부족함을 보인다는 것을 입증합니다. 또한 모델 크기와 학습 시간을 증가시켜도 관련 성능 지표에서는 한계에 도달하는 수준의 개선만 이루어짐을 보여주는 확장 실험을 제시합니다. 따라서 우리의 실험은 현재 모델들이 수술 활용 사례에서 여전히 상당한 장애물에 직면할 수 있음을 시사합니다. 더 나아가, 일부 장애물은 추가적인 컴퓨팅 자원으로 단순히 '규모 확장'을 통해 해결될 수 없으며 다양한 모델 아키텍처에 걸쳐 지속되어, 데이터와 라벨 가용성만이 유일한 제한 요인인지에 대한 의문을 제기합니다. 우리는 이러한 제약의 주요 원인을 논의하고 잠재적인 해결 방안을 제시합니다.

English

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks -- including multimodal data integration, human interaction, and physical effects -- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

수술 AI 비교 연구: 데이터셋, 파운데이션 모델, 그리고 Med-AGI의 장벽

A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

초록

Support