CarePilot：面向医疗领域长周期计算机任务自动化的多智能体框架

摘要

多模态智能流程正通过高效、可及的复杂现实任务自动化变革人机交互。然而，当前研究主要聚焦于短周期或通用应用（如移动端或桌面端界面），针对特定领域系统（尤其是医疗领域）的长周期自动化研究仍属空白。为此，我们推出CareFlow——一个高质量人工标注的基准数据集，涵盖医学标注工具、DICOM阅片系统、电子健康记录系统和实验室信息系统中复杂的多步骤软件工作流。在该基准测试中，现有视觉语言模型表现欠佳，难以应对医疗场景下的长周期推理与多步交互挑战。为突破此局限，我们提出基于演员-评论家范式的多智能体框架CarePilot。演员组件通过工具定位与双记忆机制（长期/短期经验）整合，根据可视化界面和系统状态预测下一语义动作；评论家组件评估每个动作，基于观测结果更新记忆，并执行动作或提供修正反馈以优化工作流。通过迭代式智能模拟，演员组件在推理过程中可进行更稳健且具推理意识的预测。实验表明，CarePilot在我们的基准测试及分布外数据集上分别以约15.26%和3.38%的优势超越强闭源与开源多模态基线模型，达到最先进性能水平。

English

Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.