SDR: 방사선 판독 보고서 생성을 위한 집합 거리 보상

초록

검증 가능한 보상을 이용한 강화 학습은 시각-언어 모델의 추론 능력을 빠르게 발전시켜 왔다. 그러나 흉부 X선 판독문 생성에서 표준 보상(즉, 정확 일치 정확도 및 단계별 과정)은 판독문이 인과적 추론 사슬이 아닌 순서가 없고 직교하는 소견들로 구성되어 있기 때문에 호환되지 않는다. 우리는 이 간극을 집합 기반 관점으로 해결한다. 각 판독문을 문장으로 분할하고 고정된 문장 변환기로 임베딩하여 순서가 없는 임베딩 집합을 얻는다. 생성된 임베딩과 참조 임베딩 간의 집합 간 거리를 연속적이고 순열 불변적인 보상으로 사용할 것을 제안한다. 두 데이터셋과 세 가지 시각-언어 모델(Qwen3-VL-2B/4B, Gemma3-4B)에 걸쳐, GRPO를 통한 집합 간 거리 기반 보상을 사용한 사후 학습이 모든 주요 지표(BERTScore, RadGraph F1, CheXbert F1에서 각각 평균 6.80%, 7.82%, 4.45%의 상대적 개선)에서 지도 미세 조정 및 정확 일치 GRPO보다 일관되게 우수했다. 동일한 집합 거리는 테스트 시 최적 N개 선택도 가능하게 한다. 즉, 학습 판독문 임베딩과의 거리로 후보를 점수화하는 것이 훈련된 모델뿐만 아니라 세 가지 폐쇄형 LLM(Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini)에서 무작위 선택보다 우수했으며, BERTScore에서 평균 16.4%의 상대적 개선을 보였다. 스트리밍 신호로 사용될 경우, 더 효율적인 형태의 테스트 시 확장을 지원한다. 즉, 생성 중간에 낮은 점수의 후보를 가지치기하여 생성 토큰을 50% 이상 줄이면서도 전체 최적 N개 선택의 판독문 품질을 유지한다. 이러한 결과들은 집합 거리 보상이 흉부 X선 판독문 생성에서 사후 학습과 테스트 시 확장을 위한 통합 신호임을 입증한다. 우리의 코드는 공개적으로 이용 가능하다: https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA

English

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA{available}.