케어파일럿: 의료 분야 장기간 컴퓨터 작업 자동화를 위한 다중 에이전트 프레임워크

초록

멀티모달 에이전트 파이프라인은 복잡한 실제 업무의 효율적이고 접근성 높은 자동화를 통해 인간-컴퓨터 상호작용을 혁신하고 있습니다. 그러나 최근 연구들은 단기적이거나 범용적인 애플리케이션(예: 모바일 또는 데스크톱 인터페이스)에 집중하여, 특히 의료 분야와 같은 도메인 특화 시스템에서의 장기적 자동화는 크게 탐구되지 않은 상태입니다. 이를 해결하기 위해 우리는 의료 주석 도구, DICOM 뷰어, 전자의무기록(EHR) 시스템, 검체정보시스템(LIS)에 걸친 복잡한 장기적 소프트웨어 워크플로우로 구성된 고품질의 인간 주석 기반 벤치마크인 CareFlow를 소개합니다. 이 벤치마크에서 기존의 시각-언어 모델(VLM)들은 장기적 추론과 의료 맥락에서의 다단계 상호작용에 어려움을 겪으며 낮은 성능을 보였습니다. 이를 극복하기 위해 우리는 액터-크리틱 패러다임을 기반으로 한 다중 에이전트 프레임워크인 CarePilot을 제안합니다. 액터(Actor)는 도구 기반 결합과 이중 메모리 메커니즘(장기 및 단기 경험)을 통합하여 시각적 인터페이스와 시스템 상태로부터 다음 의미론적 행동을 예측합니다. 크리틱(Critic)은 각 행동을 평가하고 관찰된 효과를 바탕으로 메모리를 업데이트하며, 워크플로우를 개선하기 위해 행동을 실행하거나 수정 피드백을 제공합니다. 반복적인 에이전트 시뮬레이션을 통해 액터는 추론 과정에서 더욱 견고하고 인식 추론적인 예측을 수행하도록 학습합니다. 우리의 실험 결과, CarePilot은 최첨단 성능을 달성하며, 우리의 벤치마크와 분포 외 데이터셋에서 강력한 클로즈드소스 및 오픈소스 멀티모달 기준선을 각각 약 15.26% 및 3.38% 앞섰습니다.

English

Multimodal agentic pipelines are transforming human-computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision-language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor-critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms (long-term and short-term experience) to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Through iterative agentic simulation, the Actor learns to perform more robust and reasoning-aware predictions during inference. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38%, respectively, on our benchmark and out-of-distribution dataset.

케어파일럿: 의료 분야 장기간 컴퓨터 작업 자동화를 위한 다중 에이전트 프레임워크

CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

초록

Support