梅萨:多模态医疗智能体系统
Meissa: Multi-modal Medical Agentic Intelligence
March 9, 2026
作者: Yixiong Chen, Xinyi Bai, Yue Pan, Zongwei Zhou, Alan Yuille
cs.AI
摘要
多模态大语言模型(MM-LLMs)在医学影像理解与临床推理领域已展现出强大性能。近期出现的医疗智能体系统通过工具调用与多智能体协作机制进一步扩展了其能力,实现了复杂医疗决策功能。然而这些系统几乎完全依赖前沿模型(如GPT),其基于API的部署方式存在高成本、高延迟及隐私风险等问题,难以满足临床本地化部署需求。本文提出Meissa——一个轻量级的40亿参数医疗多模态大语言模型,可将智能体能力离线化。该模型并非简单模仿静态答案,而是通过蒸馏前沿模型的结构化轨迹数据,同步学习何时启动外部交互(策略选择)以及如何执行多步交互(策略执行)。具体贡献包括:(1)统一轨迹建模:将推理与行动轨迹纳入"状态-行动-观测"形式化框架,使单一模型能泛化至异构医疗环境;(2)三级分层监督:根据模型自身错误触发从直接推理到工具增强、再到多智能体交互的渐进式升级,显式学习难度感知的策略选择机制;(3)前瞻-回溯监督:通过将探索性前向轨迹与后见之明理性化的执行轨迹配对,稳定习得有效交互策略。基于4万条精选轨迹训练后,Meissa在涵盖放射学、病理学及临床推理的13个医疗基准测试中,于16个评估场景中的10项表现达到或超越专有前沿智能体。相比Gemini-3等典型前沿模型,Meissa仅需1/25参数量即可实现完全离线运行,端到端延迟较API部署降低22倍。相关数据、模型及环境已发布于https://github.com/Schuture/Meissa。
English
Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model's own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.