Colon-Bench: 전체 대장내시경 영상에서 확장 가능한 병변 주석화를 위한 에이전트 기반 워크플로우

초록

대장암 예방을 위해 대장내시경을 통한 조기 검진이 중요하지만, 이 분야의 강력한 AI 시스템 개발은 밀집하게 주석이 달린 장기간 비디오 데이터셋의 부족으로 어려움을 겪고 있습니다. 기존 데이터셋은 주로 단일 종류의 폴립 탐지에 초점을 맞추고 있으며, 현대적 다중모달 대규모 언어 모델(MLLM)을 평가하는 데 필요한 풍부한 공간적, 시간적, 언어적 주석이 부족합니다. 이 중요한 격차를 해결하기 위해, 우리는 새로운 다단계 에이전트 기반 워크플로우를 통해 생성된 Colon-Bench를 소개합니다. 우리의 파이프라인은 전체 과정 비디오에 확장 가능하게 주석을 달기 위해 시간적 제안, 경계 상자 추적, AI 기반 시각적 확인, 그리고 인간 참여형 검토를 원활하게 통합합니다. 그 결과 검증된 벤치마크는 범위 측면에서 전례가 없으며, 528개의 비디오, 14개의 distinct 병변 범주(폴립, 궤양, 출혈 등 포함), 300,000개 이상의 경계 상자, 213,000개의 분할 마스크, 그리고 133,000단어에 달하는 임상 설명을 포함합니다. 우리는 Colon-Bench를 활용하여 병변 분류, 개방형 어휘 비디오 객체 분할(OV-VOS), 그리고 비디오 시각 질의응답(VQA)에 걸쳐 최첨단 MLLM을 엄격하게 평가합니다. MLLM 결과는 의료 영역에서 SAM-3 대비 놀라울 정도로 높은 위치 파악 성능을 보여줍니다. 마지막으로, 우리는 MLLM의 일반적인 VQA 오류를 분석하여 새로운 "대장내시경-기술" 프롬프트 전략을 도입하였으며, 이를 통해 대부분의 MLLM에서 제로샷 성능을 최대 9.7%까지 향상시켰습니다. 데이터셋과 코드는 https://abdullahamdi.com/colon-bench 에서 이용할 수 있습니다.

English

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .

Colon-Bench: 전체 대장내시경 영상에서 확장 가능한 병변 주석화를 위한 에이전트 기반 워크플로우

Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

초록

Support