MedSkillAudit：面向医学研究智能体的领域专用技能审计框架

摘要

背景：在AI智能体系统中，智能体技能正日益作为模块化、可复用的能力单元被部署。医学研究类智能体技能需要超越通用评估的保障机制，包括科学严谨性、方法有效性、可复现性和边界安全性。本研究开发并初步评估了针对医学研究智能体技能的领域专用审计框架，重点考察其相对于专家评审的可靠性。方法：我们开发了MedSkillAudit（skill-auditor@1.0）分层框架，用于评估技能部署前的就绪度。共评估五大医学研究类别（每类15项）的75项技能。两位专家独立给出质量评分（0-100分）、四级发布判定（生产就绪/有限发布/仅测试版/拒绝）及高风险失败标记。采用ICC(2,1)和线性加权Cohen's kappa量化系统与专家的一致性，并以评审员间一致性为基准。结果：专家共识质量评分均值为72.4（标准差=13.0）；57.3%的技能未达到有限发布阈值。MedSkillAudit的ICC(2,1)达0.449（95%CI：0.250-0.610），优于评审员间的ICC值0.300。系统与共识评分差异（标准差=9.5）小于专家间差异（标准差=12.4），且无方向性偏差（Wilcoxon p=0.613）。方案设计类别的一致性最高（ICC=0.551）；学术写作类别出现负ICC值（-0.567），反映评分标准与专家认知的结构性错配。结论：针对特定领域的部署前审计可为医学研究智能体技能治理提供实践基础，通过面向科研场景的结构化审计工作流，对通用质量检查形成有效补充。

English

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit (skill-auditor@1.0), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

MedSkillAudit：面向医学研究智能体的领域专用技能审计框架

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

摘要

Support