AutoMedBench: 面向智能体AI模型的医学自动研究

摘要

自主智能体日益被期望支持端到端的医学AI研究工作流，而不再局限于孤立的预测任务或短篇临床问答。然而，现有医学智能体基准主要评估最终输出，对智能体在研究过程中的行为可见性有限。为弥补这一空白，我们提出AutoMedBench——一个面向自主医学AI研究工作流感知的基准，涵盖多种医学影像与多模态推理任务，将智能体执行组织为统一的五阶段工作流（S1-S5）：规划、搭建、验证、推理与提交。该基准包含长周期任务，每次运行平均33轮智能体交互，覆盖五个研究主线：分割、图像增强、视觉问答（VQA）、报告生成与病灶检测。每个任务在两种难度层级（Lite与Standard）下进行评估，两者使用相同的数据与指标，但任务简报的支持程度不同；每次运行同时依据最终任务性能与S1-S5阶段得分进行评分，从而实现对从初始任务简报至最终提交产物的阶段级分析。在数千次记录的运行中，阶段级评分显示，平均而言验证是表现最弱的工作流阶段，而搭建最强，表明当前智能体更擅长使流程可执行，而非验证其可靠性。运行后错误分析进一步表明，验证与提交失败主导了标记错误，分别占触发错误代码的37.7%和38.1%，而任务理解错误罕见，仅占0.9%；触发一个错误代码的运行平均总分比无错误代码的运行低48%。

English

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.