结肠镜基准测试：全流程结肠镜检查视频中可扩展密集病灶标注的智能体工作流

摘要

结肠镜检查的早期筛查对预防结肠癌至关重要，但缺乏密集标注的长序列视频数据集阻碍了该领域稳健人工智能系统的开发。现有数据集主要聚焦单类别息肉检测，缺乏评估现代多模态大语言模型（MLLM）所需的丰富时空与语言标注。为填补这一关键空白，我们通过新型多阶段智能体工作流构建了Colon-Bench标注系统。该流程无缝整合时序提案、边界框追踪、AI视觉验证和人机协同审核，实现对全流程手术视频的可扩展标注。最终建成的验证基准在规模上达到空前水平，包含528段视频、14种病灶类型（含息肉、溃疡及出血等）、超30万个边界框、21.3万个分割掩码和13.3万字临床描述。我们运用Colon-Bench对前沿MLLM模型进行病灶分类、开放词汇视频目标分割（OV-VOS）和视频视觉问答（VQA）的严格评估。结果显示MLLM在医学领域的定位性能显著优于SAM-3。通过分析MLLM的常见VQA错误，我们进一步提出创新的"结肠技能"提示策略，使零样本MLLM在多数模型中的性能提升最高达9.7%。数据集与代码已公开于https://abdullahamdi.com/colon-bench。

English

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .