AutoMedBench: エージェント型AIモデルによる医用自動研究に向けて

要旨

自律エージェントは、孤立した予測タスクや短形式の臨床質問応答を超え、エンドツーエンドの医療AI研究ワークフローを支援することがますます期待されている。しかし、既存の医療エージェントベンチマークは主に最終出力を評価するため、研究プロセス内でのエージェントの挙動に対する可視性が限られている。このギャップに対処するため、我々はAutoMedBenchを提案する。これは、多様な医用画像およびマルチモーダル推論タスクにわたる自律型医療AI研究のためのワークフロー認識型ベンチマークであり、エージェントの実行を統一された5段階のワークフロー（S1-S5：計画、セットアップ、検証、推論、提出）に整理するものである。本ベンチマークは、各実行で平均33エージェントターンからなる長期的タスクで構成され、セグメンテーション、画像強調、視覚的質問応答（VQA）、レポート生成、病変検出の5つの研究トラックにわたる。各タスクはLiteとStandardの2つの難易度レベルで評価され、これらは同一のデータと評価指標を使用するが、タスク概要の足場かけの量が異なる。また、各実行は最終タスク性能とS1-S5段階スコアの両方を用いて採点され、初期タスク概要から最終提出成果物までの段階別分析を可能にする。数千の記録された実行にわたる段階別スコアリングにより、Validate（検証）が平均的に最も弱いワークフロー段階である一方、Setup（セットアップ）が最も強いことが明らかになり、これは現在のエージェントがパイプラインの信頼性を検証することよりも、実行可能にすることに優れていることを示唆している。実行後のエラー分析はさらに、検証と提出の失敗がタグ付けされたエラーを支配しており、それぞれ発火コードの37.7%と38.1%を占め、一方でタスク理解エラーは0.9%と稀であることを示している。また、1つの発火エラーコードを持つ実行は、平均的にエラーコードがない実行よりも全体スコアが48%低い。

English

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.