AutoMedBench: 에이전트 AI 모델을 통한 의료 자동 연구를 향하여

초록

자율적 에이전트는 단순한 예측 작업이나 단답형 임상 질의응답을 넘어, 의료-AI 연구 워크플로우의 전 과정(엔드 투 엔드)을 지원할 것으로 점점 더 기대되고 있다. 그러나 기존 의료 에이전트 벤치마크는 주로 최종 출력만 평가할 뿐, 연구 과정 내에서의 에이전트 행동에 대한 가시성은 제한적이다. 이러한 격차를 해소하기 위해, 우리는 AutoMedBench를 제시한다. 이는 다양한 의료 영상 및 다중 모달 추론 작업에 걸친 자율적 의료-AI 연구를 위한 워크플로우 인식 벤치마크로, 에이전트 실행을 통합된 다섯 단계 워크플로우(S1-S5)로 구성한다: 계획(Plan), 설정(Setup), 검증(Validate), 추론(Inference), 제출(Submit). 각 실행은 평균 33회의 에이전트 턴(turn)으로 구성된 장기 과제(long-horizon task)를 포함하며, 분할(Segmentation), 영상 개선(Image Enhancement), 시각 질의응답(VQA), 보고서 생성(Report Generation), 병변 검출(Lesion Detection)의 다섯 가지 연구 트랙에 걸쳐 있다. 각 과제는 Lite와 Standard의 두 가지 난이도 계층으로 평가되며, 동일한 데이터와 지표를 사용하지만 작업 요약(task-brief)의 지원 정도가 다르다. 각 실행은 최종 과제 성능과 S1-S5 단계 점수를 모두 사용하여 평가되며, 초기 작업 요약부터 최종 제출된 결과물까지 단계별 분석을 가능하게 한다. 수천 건의 기록된 실행을 통해 단계별 점수를 분석한 결과, 검증(Validate)이 평균적으로 가장 약한 워크플로우 단계인 반면 설정(Setup)이 가장 강한 것으로 나타났는데, 이는 현재 에이전트가 파이프라인을 실행 가능하게 만드는 데는 능숙하지만 그 신뢰성을 검증하는 데는 미흡함을 시사한다. 실행 후 오류 분석에서도 검증 및 제출 실패가 태그된 오류를 지배하여 각각 전체 발동 코드의 37.7%와 38.1%를 차지한 반면, 과제 이해 오류는 0.9%로 드물었으며, 하나의 오류 코드가 발동된 실행은 오류 코드가 없는 실행에 비해 평균 전체 점수가 48% 낮았다.

English

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.