ChatPaper.aiChatPaper

CarePilot:面向医疗领域长周期计算机任务自动化的多智能体框架

CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

March 25, 2026
作者: Akash Ghosh, Tajamul Ashraf, Rishu Kumar Singh, Numan Saeed, Sriparna Saha, Xiuying Chen, Salman Khan
cs.AI

摘要

多模态智能流程正通过高效、便捷地自动化复杂现实任务,深刻变革着人机交互模式。然而当前研究多聚焦于短周期或通用型应用(如移动端或桌面端界面),针对特定领域系统(尤其是医疗健康领域)的长周期自动化研究仍处于探索空白。为此,我们推出CareFlow——一个高质量人工标注的基准数据集,涵盖医学标注工具、DICOM阅片系统、电子健康记录系统和实验室信息系统中复杂的长期软件工作流。在该基准测试中,现有视觉语言模型表现欠佳,难以应对医疗场景下的长周期推理与多步骤交互。为突破此局限,我们提出基于行动者-评判者范式的多智能体框架CarePilot。行动者模块融合工具定位与双记忆机制(长期/短期经验),通过视觉界面和系统状态预测下一语义动作;评判者模块评估每个动作,根据观测结果更新记忆,并执行动作或提供修正反馈以优化工作流。通过迭代式智能模拟,行动者在推理过程中能逐步实现更稳健、更具推理意识的预测。实验表明,CarePilot在我们构建的基准数据集及分布外数据集上分别以15.26%和3.38%的显著优势超越强闭源与开源多模态基线模型,达到最先进性能水平。
English
Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.
PDF81March 27, 2026