SDR:放射學報告生成的集合距離獎勵
SDR: Set-Distance Rewards for Radiology Report Generation
May 30, 2026
作者: Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert
cs.AI
摘要
具備可驗證獎勵的強化學習已迅速推進了視覺-語言模型的推理能力。然而,在胸部X光報告生成任務中,標準獎勵(即精確匹配準確率與逐步推理過程)並不相容,因為這類報告由無序且正交的發現組成,而非因果推理鏈條。為解決此問題,我們提出基於集合的觀點:將每份報告拆解為句子,並經由凍結的句子轉換器嵌入,形成無序的嵌入集合。我們提出使用生成嵌入與參考嵌入之間的「集到集距離」作為連續且具置換不變性的獎勵。在兩個資料集與三種視覺-語言模型(Qwen3-VL-2B/4B、Gemma3-4B)上,採用基於集到集距離的獎勵進行GRPO後訓練,在所有主要指標(BERTScore、RadGraph F1、CheXbert F1)上一致優於監督式微調與精確匹配GRPO,平均相對改善幅度分別為%6.80、%7.82與%4.45。相同的集合距離亦可用於測試時的最佳N選擇:透過候選報告與訓練報告嵌入之間的距離進行評分,不僅優於我們訓練模型的隨機選擇,亦優於三種封閉源大型語言模型(Mistral-Small、Gemini-2.5 Flash-Lite、GPT-4o-mini),在BERTScore上的平均相對改善達%16.4。作為串流訊號使用時,它們支援更高效的測試時擴展方式:在生成過程中剪除低分候選,可減少超過50%的生成令牌,同時保留完整最佳N選擇的發現品質。綜合以上結果,本工作確立了集合距離獎勵作為胸部X光報告生成中後訓練與測試時擴展的統一訊號。我們的程式碼已公開於 https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA{available}。
English
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA{available}.