RadEval: 放射線科テキスト評価のためのフレームワーク

要旨

我々は、放射線学テキストを評価するための統一されたオープンソースフレームワークであるRadEvalを紹介します。RadEvalは、古典的なn-gram重複（BLEU、ROUGE）や文脈的指標（BERTScore）から、臨床概念ベースのスコア（F1CheXbert、F1RadGraph、RaTEScore、SRR-BERT、TemporalEntityF1）、そして先進的なLLMベースの評価指標（GREEN）まで、多様なメトリクスを統合しています。我々は実装を洗練・標準化し、GREENを拡張して複数の画像モダリティをサポートするより軽量なモデルを提供し、ドメイン固有の放射線学エンコーダを事前学習することで、強力なゼロショット検索性能を実証しました。また、450以上の臨床的に重要なエラーレベルを含む詳細な専門家アノテーションデータセットを公開し、異なるメトリクスが放射線科医の判断とどのように相関するかを示します。最後に、RadEvalは統計的検定ツールと、複数の公開データセットにわたるベースラインモデル評価を提供し、放射線学レポート生成における再現性と堅牢なベンチマークを容易にします。

English

We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.

RadEval: 放射線科テキスト評価のためのフレームワーク

RadEval: A framework for radiology text evaluation

要旨

Support