Colon-Bench: 全大腸内視鏡検査ビデオにおけるスケーラブルな密集病変アノテーションのためのエージェント型ワークフロー

要旨

大腸内視鏡検査による早期スクリーニングは大腸癌予防に極めて重要であるが、この分野における堅牢なAIシステムの開発は、密に注釈付けされた長尺動画データセットの不足によって妨げられている。既存のデータセットは主に単一クラスのポリープ検出に焦点を当てており、現代のマルチモーダル大規模言語モデル（MLLM）を評価するために必要な、豊富な空間的、時間的、言語的注釈を欠いている。この重要なギャップを埋めるため、我々は新規の多段階エージェントワークフローによって生成されたColon-Benchを提案する。我々のパイプラインは、時間的提案、バウンディングボックストラッキング、AI駆動の視覚的確認、ヒューマンインザループレビューをシームレスに統合し、全手順の動画へのスケーラブルな注釈付けを実現する。結果として得られた検証済みベンチマークは、その範囲において前例のないものであり、528本の動画、14の異なる病変カテゴリー（ポリープ、潰瘍、出血を含む）、30万以上のバウンディングボックス、21万3千のセグメンテーションマスク、13万3千語に及ぶ臨床記述を含む。我々はColon-Benchを利用し、病変分類、オープン語彙ビデオオブジェクトセグメンテーション（OV-VOS）、動画視覚質問応答（VQA）において、最新のMLLMを厳密に評価する。MLLMの結果は、医療分野においてSAM-3と比較して驚くほど高い位置同定性能を示した。最後に、MLLMによる一般的なVQAの誤りを分析し、新規の「大腸スキル」プロンプト戦略を導入し、ほとんどのMLLMにおいてゼロショット性能を最大9.7%向上させた。データセットとコードはhttps://abdullahamdi.com/colon-bench で公開されている。

English

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .

Colon-Bench: 全大腸内視鏡検査ビデオにおけるスケーラブルな密集病変アノテーションのためのエージェント型ワークフロー

Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

要旨

Support