SDR:用于放射学报告生成的集合距离奖励
SDR: Set-Distance Rewards for Radiology Report Generation
May 30, 2026
作者: Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert
cs.AI
摘要
基于可验证奖励的强化学习已迅速推进了视觉-语言模型的推理能力。然而,对于胸部X光报告生成任务而言,标准奖励(即精确匹配准确率和逐步过程)并不适用,因为这类报告由无序且正交的发现组成,而非因果推理链。我们通过基于集合的视角解决了这一问题:将每份报告分割为句子,并由冻结的句子变换器进行嵌入,从而得到无序的嵌入集合。我们提出将生成嵌入与参考嵌入之间的集合到集合距离作为连续的、具有置换不变性的奖励。在两个数据集和三种视觉-语言模型(Qwen3-VL-2B/4B、Gemma3-4B)上,采用基于集合到集合距离奖励的GRPO进行后训练,在所有主要指标(BERTScore、RadGraph F1和CheXbert F1)上均持续优于监督微调和精确匹配GRPO(相对改进平均分别为6.80%、7.82%和4.45%)。相同的集合距离还可用于测试时的最佳N选:根据候选嵌入与训练报告嵌入的距离进行评分,在我们的训练模型以及三种闭源大语言模型(Mistral-Small、Gemini-2.5 Flash-Lite、GPT-4o-mini)上,该方法的性能优于随机选择,BERTScore平均相对改进16.4%。作为流信号使用时,这些距离支持一种更高效的测试时扩展形式:在生成过程中剪枝低分候选,可减少超过50%的生成token,同时保持与完整最佳N选相当的发现质量。这些结果共同确立了集合距离奖励作为胸部X光报告生成中后训练与测试时扩展的统一信号。我们的代码已公开:https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA
English
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA{available}.