MedSkillAudit: 의학 연구 에이전트 기술을 위한 도메인 특화 감사 프레임워크

초록

배경: AI 에이전트 시스템에서 에이전트 기술은 모듈화되고 재사용 가능한 능력 단위로 점점 더 많이 배포되고 있습니다. 의학 연구 에이전트 기술은 일반적인 평가를 넘어 과학적 정직성, 방법론적 타당성, 재현성 및 경계 안전성 등의 안전장치가 필요합니다. 본 연구는 전문가 검토 대비 신뢰성에 중점을 두고 의학 연구 에이전트 기술을 위한 도메인 특화 감사 프레임워크를 개발하고 예비 평가를 수행했습니다. 방법론: 배포 전 기술 출시 준비 상태를 평가하는 계층적 프레임워크인 MedSkillAudit(skill-auditor@1.0)를 개발했습니다. 5개의 의학 연구 범주(범주당 15개)에 걸쳐 총 75개의 기술을 평가했습니다. 두 명의 전문가가 독립적으로 품질 점수(0-100), 순서형 출시 판정(프로덕션 준비 완료 / 제한적 출시 / 베타 전용 / 거부), 그리고 고위험 실패 플래그를 부여했습니다. 시스템과 전문가 간 일치도는 ICC(2,1) 및 선형 가중 Cohen's kappa를 사용하여 정량화되었으며, 인간 평가자 간 기준치와 비교되었습니다. 결과: 평균 합의 품질 점수는 72.4(SD = 13.0)였으며, 기술의 57.3%가 '제한적 출시' 기준치 미만이었습니다. MedSkillAudit은 ICC(2,1) = 0.449(95% CI: 0.250-0.610)을 달성하여 인간 평가자 간 ICC인 0.300을 초과했습니다. 시스템-합의 점수 차이(SD = 9.5)는 전문가 간 차이(SD = 12.4)보다 작았으며, 방향성 편향은 없었습니다(Wilcoxon p = 0.613). 프로토콜 설계 범주에서 가장 강력한 일치도(ICC = 0.551)를 보인 반면, 학술 작문 범주는 부정적인 ICC(-0.567)를 보여 평가 기준과 전문가 판단 간 구조적 불일치를 반영했습니다. 결론: 도메인 특화 배포 전 감사는 과학적 사용 사례에 맞춰 구조화된 감사 워크플로우로 일반적 품질 점검을 보완함으로써 의학 연구 에이전트 기술 관리를 위한 실용적인 기반을 제공할 수 있습니다.

English

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit (skill-auditor@1.0), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

MedSkillAudit: 의학 연구 에이전트 기술을 위한 도메인 특화 감사 프레임워크

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

초록

Support