ClinSeekAgent: エージェント型臨床推論のためのマルチモーダルエビデンス探索の自動化

要旨

大規模言語モデル（LLM）やエージェントシステムは臨床意思決定支援において有望な成果を示しているが、既存の研究の多くは証拠が既に整理されモデルに提供されているという前提に立っている。しかし実際の臨床ワークフローでは、エージェントが能動的に証拠を探索し、反復的に計画を立て、異種の情報源からマルチモーダルな証拠を統合する必要がある。本稿では、受動的な証拠の消費から能動的な証拠の獲得へとパラダイムを転換する、動的なマルチモーダル証拠探索のための自動エージェントフレームワークClinSeekAgentを紹介する。ClinSeekAgentは、臨床クエリと生データソースへのアクセスのみを与えられ、医学知識ベースへの問い合わせ、生のEHRのナビゲーション、医療画像ツールの呼び出しにより証拠を収集し、新しい情報が得られるたびに仮説を洗練し、収集した証拠を根拠に基づく臨床判断へと統合する。ClinSeekAgentは、先端LLMの推論時エージェントとして機能するだけでなく、高品質なエージェント軌跡をコンパクトなオープンソースモデルに蒸留するための学習時パイプラインとしても機能する。推論時の有効性を検証するため、固定された事前選択証拠に基づくCurated Input推論と、生臨床データに対するAutomated Evidence-Seekingを組み合わせたClinSeek-Benchを構築した。テキストのみのEHRタスクでは、ClinSeekAgentによりClaude Opus 4.6の総合F1値が60.0から63.2に、MiniMax M2.5が43.1から47.3に向上し、評価した9つのホストモデルのうち7つで陽性リスク予測の改善が認められた。マルチモーダルタスクでは、ClinSeekAgentによりClaude Opus 4.6が47.5から62.6（+15.1）に向上し、評価した全モデルが3つのCXR関連タスク群すべてで改善を示した。さらに、エージェント的な証拠探索軌跡をClinSeek-35B-A3Bに蒸留することで、ClinSeekAgentを学習パイプラインとして検証した。これにより、既存のAgentEHR-Benchにおける平均F1値が34.0となり、Qwen3.5-35B-A3Bベースラインを+11.9ポイント上回り、Claude Opus 4.6に迫る性能を達成した。

English

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.