ClinSeekAgent：面向智能临床推理的自动化多模态证据检索

摘要

大型语言模型（LLMs）及智能体系统在临床决策支持方面已展现出潜力，但现有工作大多假设证据已被预先整理并直接交付给模型。然而，真实临床工作流程要求智能体能够主动搜寻、迭代规划并综合来自异构来源的多模态证据。本文提出ClinSeekAgent，一个面向动态多模态证据搜寻的自动化智能体框架，将范式从被动证据消费转变为主动证据获取。仅需给定临床查询及对原始数据源的访问权限，ClinSeekAgent便可通过查询医学知识库、导航原始电子健康记录（EHR）以及调用医学影像工具来收集证据；随着新信息出现而细化假设；并将收集到的证据整合为基于实据的临床决策。ClinSeekAgent既可作为前沿大语言模型的推理时智能体，也可作为训练时的流程管道，用于将高质量智能体轨迹蒸馏至紧凑型开源模型。为验证其推理时有效性，我们构建了ClinSeek-Bench基准，该基准将基于预设固定证据的推理任务与基于原始临床数据的自动化证据搜寻任务进行配对。在纯文本EHR任务中，ClinSeekAgent将Claude Opus 4.6的整体F1值从60.0提升至63.2，将MiniMax M2.5从43.1提升至47.3，且在9个评估的宿主模型中有7个在风险预测上获得正向收益。在多模态任务中，ClinSeekAgent将Claude Opus 4.6从47.5提升至62.6（+15.1）；所有评估模型在三个与胸部X光（CXR）相关的任务组中均实现提升。我们进一步验证了ClinSeekAgent作为训练管道的有效性：通过将智能体证据搜寻轨迹蒸馏至ClinSeek-35B-A3B，该模型在现有AgentEHR-Bench基准上取得了34.0的平均F1值，较其Qwen3.5-35B-A3B基线提升+11.9分，并接近Claude Opus 4.6的水平。

English

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.