ClinSeekAgent: 에이전트적 임상 추론을 위한 다중 양식 증거 탐색 자동화

초록

대규모 언어 모델(LLM)과 에이전트 시스템은 임상 의사 결정 지원에 가능성을 보여주었지만, 기존 연구는 대부분 증거가 이미 선별되어 모델에 제공되었다고 가정한다. 실제 임상 워크플로우에서는 에이전트가 능동적으로 증거를 탐색하고, 반복적으로 계획을 수립하며, 이질적 출처로부터 다중 양식 증거를 종합해야 한다. 본 논문에서는 수동적 증거 소비에서 능동적 증거 획득으로 패러다임을 전환하는 동적 다중 양식 증거 탐색을 위한 자동화된 에이전트 프레임워크인 ClinSeekAgent를 소개한다. ClinSeekAgent는 임상 질의와 원시 데이터 소스에 대한 접근 권한만 주어지면 의학 지식 베이스 조회, 원시 EHR 탐색, 의료 영상 도구 호출을 통해 증거를 수집하고, 새로운 정보가 등장함에 따라 가설을 정제하며, 수집된 증거를 근거 기반 임상 결정으로 통합한다. ClinSeekAgent는 최첨단 LLM을 위한 추론 시점 에이전트로, 그리고 고품질 에이전트 궤적을 컴팩트한 오픈소스 모델로 증류하기 위한 훈련 시점 파이프라인으로 기능한다. 추론 시점 효용성을 검증하기 위해, 고정된 사전 선택 증거를 사용한 Curated Input 추론과 원시 임상 데이터를 통한 Automated Evidence-Seeking을 짝지은 ClinSeek-Bench를 구축했다. 텍스트 전용 EHR 과제에서 ClinSeekAgent는 Claude Opus 4.6의 전체 F1을 60.0에서 63.2로, MiniMax M2.5를 43.1에서 47.3으로 개선했으며, 평가된 9개 호스트 모델 중 7개에서 긍정적인 위험 예측 향상을 보였다. 다중 양식 과제에서 ClinSeekAgent는 Claude Opus 4.6을 47.5에서 62.6(+15.1)으로 개선했으며, 평가된 모든 모델이 세 가지 CXR 관련 과제 그룹 전반에서 향상되었다. 또한 ClinSeekAgent가 훈련 파이프라인으로서 유효함을 검증하기 위해 에이전트 증거 탐색 궤적을 ClinSeek-35B-A3B로 증류했으며, 이는 기존 AgentEHR-Bench에서 평균 F1 34.0을 달성하여 Qwen3.5-35B-A3B 베이스라인 대비 +11.9점 향상되었고 Claude Opus 4.6에 근접했다.

English

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.