CarePilot: 医療における長期的コンピュータタスク自動化のためのマルチエージェントフレームワーク

要旨

マルチモーダルエージェントパイプラインは、複雑な実世界タスクの効率的でアクセスしやすい自動化を可能にすることで、人間とコンピュータのインタラクションを変革しつつあります。しかし、最近の研究努力は短期的または汎用目的のアプリケーション（例：モバイルやデスクトップインターフェース）に焦点が当てられており、医療をはじめとするドメイン固有のシステムにおける長期的な自動化は、ほとんど未開拓のままです。この問題に対処するため、我々はCareFlowを提案します。これは、医療用注釈ツール、DICOMビューア、電子健康記録（EHR）システム、検査室情報システムにわたる複雑で長期的なソフトウェアワークフローから構成される、高品質な人手注釈ベンチマークです。このベンチマークにおいて、既存の視覚言語モデル（VLM）は、医療文脈における長期的推論と多段階インタラクションに苦戦し、低い性能を示します。この課題を克服するため、我々はアクター・クリティックパラダイムに基づくマルチエージェントフレームワークであるCarePilotを提案します。アクターは、ツール接地と二重記憶メカニズム（長期記憶と短期記憶）を統合し、視覚的インターフェースとシステム状態から次の意味的アクションを予測します。クリティックは各アクションを評価し、観測された効果に基づいて記憶を更新し、ワークフローを洗練させるためにアクションを実行するか、修正フィードバックを提供します。反復的なエージェントシミュレーションを通じて、アクターは推論時に、よりロバストで推論を意識した予測を実行することを学習します。我々の実験では、CarePilotが最先端の性能を達成し、提案ベンチマークおよび分布外データセットにおいて、強力なクローズドソースおよびオープンソースのマルチモーダルベースラインを、それぞれ約15.26％および3.38％上回ることを示しました。

English

Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.

CarePilot: 医療における長期的コンピュータタスク自動化のためのマルチエージェントフレームワーク

CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

要旨

Support