ClinSeekAgent：自動化多模態證據搜尋以實現代理式臨床推理

摘要

大型語言模型（LLMs）與智能代理系統在臨床決策支持方面已展現出潛力，但現有研究大多假設證據已預先整理好並直接提供給模型。實際的臨床工作流程則要求代理主動搜尋、迭代規劃，並從異質來源中綜合多模態證據。本文提出 ClinSeekAgent，這是一個自動化的動態多模態證據搜尋代理框架，將模式從被動的證據消費轉變為主動的證據獲取。僅依據臨床查詢與原始資料來源的存取權限，ClinSeekAgent 即可透過查詢醫學知識庫、瀏覽原始電子健康紀錄（EHR）以及調用醫學影像工具來收集證據；隨著新資訊出現，它會調整假設；並將收集到的證據整合為有根據的臨床決策。ClinSeekAgent 既可作為前沿 LLM 的推理時代理，也可作為訓練時管線，將高品質的代理軌跡提煉至精簡的開源模型中。為驗證其推理時效能，我們建構了 ClinSeek-Bench，該基準將基於固定預選證據的 Curated Input 推理與基於原始臨床資料的自動證據搜尋（Automated Evidence-Seeking）進行配對。在純文字 EHR 任務中，ClinSeekAgent 將 Claude Opus 4.6 的整體 F1 從 60.0 提升至 63.2，將 MiniMax M2.5 從 43.1 提升至 47.3，且在 9 個受評主模型中，有 7 個在風險預測上獲得正向提升。在多模態任務中，ClinSeekAgent 將 Claude Opus 4.6 從 47.5 提升至 62.6（+15.1）；所有受評模型在三個與 CXR 相關的任務群組中均有所改善。我們進一步驗證 ClinSeekAgent 作為訓練管線的效果，將代理式證據搜尋軌跡提煉至 ClinSeek-35B-A3B 中，其在現有 AgentEHR-Bench 上達到 34.0 的平均 F1，相較其 Qwen3.5-35B-A3B 基線提升 11.9 分，並接近 Claude Opus 4.6 的表現。

English

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.