ChatPaper.aiChatPaper

RadEval:放射學文本評估框架

RadEval: A framework for radiology text evaluation

September 22, 2025
作者: Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, David Eyre, Jean-Benoit Delbrouck
cs.AI

摘要

我们推出RadEval,一个统一的开源框架,用于评估放射学文本。RadEval整合了多样化的评价指标,从经典的n-gram重叠度(BLEU、ROUGE)和上下文相关度量(BERTScore),到基于临床概念的评分(F1CheXbert、F1RadGraph、RaTEScore、SRR-BERT、TemporalEntityF1),以及先进的基于大型语言模型的评估器(GREEN)。我们对实现进行了优化与标准化,扩展了GREEN以支持多种成像模式,并采用更轻量级的模型,同时预训练了一个特定领域的放射学编码器,展示了强大的零样本检索性能。此外,我们发布了一个包含超过450个临床显著错误标签的专家标注数据集,并展示了不同指标与放射科医生判断之间的相关性。最后,RadEval提供了统计测试工具及在多个公开可用数据集上的基线模型评估,促进了放射学报告生成领域的可重复性与稳健基准测试。
English
We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.
PDF12September 25, 2025