SDR: 放射線レポート生成のための集合距離報酬

要旨

検証可能な報酬を用いた強化学習は、視覚言語モデルにおける推論を急速に進歩させてきました。しかし、胸部X線検査レポート生成においては、標準的な報酬（すなわち、完全一致精度やステップレベルのプロセス）は互換性がありません。なぜなら、レポートは因果的な推論連鎖ではなく、順序のない直交する所見から構成されるからです。我々はこのギャップに対して集合ベースの視点で対処します：各レポートを文に分割し、凍結されたセンテンストランスフォーマーで埋め込み、順序のない埋め込み集合を生成します。生成された埋め込みと参照埋め込み間の集合間距離を、連続的で置換不変な報酬として使用することを提案します。 2つのデータセットと3つの視覚言語モデル（Qwen3-VL-2B/4B、Gemma3-4B）にわたって、GRPOによる集合間距離に基づく報酬を用いたポストトレーニングは、主要な全指標（BERTScore、RadGraph F1、CheXbert F1）において、教師ありファインチューニングや完全一致GRPOを一貫して上回りました（それぞれ平均相対改善率\%6.80、\%7.82、\%4.45）。同じ集合距離は、テスト時のベストオブN選択も可能にします：訓練レポートの埋め込みとの距離に基づいて候補をスコアリングすることで、我々の訓練済みモデルおよび3つのクローズドソースLLM（Mistral-Small、Gemini-2.5 Flash-Lite、GPT-4o-mini）において、ランダム選択を上回り、BERTScoreで平均相対改善率\%16.4を達成しました。ストリーミング信号として使用することで、より効率的なテスト時スケーリングを実現します：生成途中で低スコアの候補を枝刈りすることで、完全なベストオブN選択のFindings品質を維持しつつ、生成トークンを50\%以上削減します。これらの結果は、胸部X線検査レポート生成において、集合距離報酬がポストトレーニングとテスト時スケーリングの両方のための統一された信号であることを確立しています。私たちのコードは公開されています：https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA{available}

English

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA{available}.