Pathology-CoT: 専門家のスライド画像診断行動から視覚的連鎖思考エージェントを学習する

要旨

ホールスライド画像の診断は、倍率の変更や視野の移動を伴うインタラクティブで多段階のプロセスである。最近の病理学基盤モデルは強力だが、次にどの視野を調べるか、倍率を調整し、説明可能な診断を下すといった実用的なエージェントシステムはまだ不足している。その障壁はデータにある：教科書やオンラインには記載されていない、暗黙的で経験に基づく専門家の視察行動を、臨床的に整合性を持って大規模に監督する方法が存在しないため、大規模言語モデルのトレーニングには含まれていない。我々は、標準的なWSIビューアと連携して、日常的なナビゲーションを目立たずに記録し、ビューアのログを標準化された行動コマンド（離散的な倍率での検査や覗き見）とバウンディングボックスに変換するAIセッションレコーダーを導入した。軽量な人間のループ内レビューにより、AIが草案した根拠をPathology-CoTデータセットに変換し、「どこを見るか」と「なぜそれが重要か」というペアの監督を、ラベリング時間を約6分の1に抑えて生成した。この行動データを用いて、我々はPathologist-o3を構築した。これは、まず関心領域を提案し、その後行動ガイド付き推論を行う2段階のエージェントである。胃腸リンパ節転移の検出において、84.5%の精度、100.0%の再現率、75.4%の正確度を達成し、最先端のOpenAI o3モデルを上回り、バックボーンを超えて汎化した。我々の知る限り、これは病理学における最初の行動に基づくエージェントシステムの一つである。日常的なビューアログを大規模で専門家が検証した監督に変換する我々のフレームワークは、エージェント病理学を実用的にし、人間に整合したアップグレード可能な臨床AIへの道を確立する。

English

Diagnosing a whole-slide image is an interactive, multi-stage process involving changes in magnification and movement between fields. Although recent pathology foundation models are strong, practical agentic systems that decide what field to examine next, adjust magnification, and deliver explainable diagnoses are still lacking. The blocker is data: scalable, clinically aligned supervision of expert viewing behavior that is tacit and experience-based, not written in textbooks or online, and therefore absent from large language model training. We introduce the AI Session Recorder, which works with standard WSI viewers to unobtrusively record routine navigation and convert the viewer logs into standardized behavioral commands (inspect or peek at discrete magnifications) and bounding boxes. A lightweight human-in-the-loop review turns AI-drafted rationales into the Pathology-CoT dataset, a form of paired "where to look" and "why it matters" supervision produced at roughly six times lower labeling time. Using this behavioral data, we build Pathologist-o3, a two-stage agent that first proposes regions of interest and then performs behavior-guided reasoning. On gastrointestinal lymph-node metastasis detection, it achieved 84.5% precision, 100.0% recall, and 75.4% accuracy, exceeding the state-of-the-art OpenAI o3 model and generalizing across backbones. To our knowledge, this constitutes one of the first behavior-grounded agentic systems in pathology. Turning everyday viewer logs into scalable, expert-validated supervision, our framework makes agentic pathology practical and establishes a path to human-aligned, upgradeable clinical AI.

Pathology-CoT: 専門家のスライド画像診断行動から視覚的連鎖思考エージェントを学習する

Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior

要旨

Support