AutoMedBench：邁向具代理能力的人工智能模型之醫學自動研究

摘要

自主智能體日益被期望支援端到端的醫療AI研究工作流程，超越孤立預測任務或短格式臨床問答的範疇。然而，現有的醫療智能體基準主要評估最終輸出，對於智能體在研究過程中的行為能見度有限。為填補此差距，我們提出AutoMedBench，這是一個專為自主醫療AI研究設計的工作流程感知基準，涵蓋多樣化的醫學影像與多模態推論任務，並將智能體執行過程組織為統一的五階段工作流程（S1-S5）：規劃、設置、驗證、推論與提交。該基準包含長時程任務，每次執行平均涉及33個智能體回合，橫跨五個研究軌道：分割、影像增強、視覺問答（VQA）、報告生成與病灶檢測。每項任務皆在兩個難度級別（精簡版與標準版）下進行評估，兩者使用相同的數據與指標，但任務簡報的支架程度不同；每次執行同時以最終任務表現與S1至S5階段評分進行打分，從而實現從初始任務簡報到最終提交產出物的階段層級分析。在數千次記錄的執行中，階段評分顯示驗證是工作流程中最弱的階段（平均而言），而設置則是最強的，這表明當前智能體更擅長使管線可執行，而非驗證其可靠性。運行後的錯誤分析進一步顯示，驗證與提交失敗在標記錯誤中佔主導地位，分別佔觸發代碼的37.7%和38.1%，而任務理解錯誤則罕見，僅佔0.9%；且觸發一個錯誤代碼的執行，其整體分數平均比無錯誤代碼的執行低48%。

English

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.