MedOpenClaw：基于未整理完整研究的可审计医学影像智能体推理系统

摘要

当前，针对视觉语言模型（VLMs）在医学影像任务中的评估方式过度简化了临床实际——其依赖人工精心筛选的二维图像，且需耗费大量人力进行整理。这种设置忽略了真实诊断的核心挑战：合格的临床智能体必须能够主动在完整三维影像序列中跨模态导航，以收集证据并最终支撑诊断决策。为此，我们提出MEDOPENCLAW——一个可审计的运行环境，使VLM能在标准医学工具或查看器（如3D Slicer）中动态操作。基于此运行时，我们进一步推出MEDFLOWBENCH，一个涵盖多序列脑部MRI与肺部CT/PET的全流程医学影像基准测试体系，通过仅查看器、工具调用和开放方法三条赛道系统化评估医学智能体能力。初步结果揭示关键发现：虽然前沿大语言模型/VLM（如Gemini 3.1 Pro与GPT-5.4）能成功操作查看器完成基础研究级任务，但当获得专业工具支持时，其性能反而因缺乏精确空间定位能力而下降。通过弥合静态图像感知与交互式临床工作流之间的鸿沟，MEDOPENCLAW与MEDFLOWBENCH为开发可审计的全流程医学影像智能体奠定了可复现的基础。

English

Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.