語義感知獎勵機制在自由形式生成中的開放式R1訓練

摘要

評估開放式長文本生成具有挑戰性，因為難以明確界定好壞輸出之間的界限。現有方法往往忽略了關鍵方面，如連貫性、風格或相關性，或者受到預訓練數據的偏見影響，使得開放式長文本評估成為一個尚未充分探索的問題。為解決這一問題，我們提出了PrefBERT，這是一個用於評估GRPO中開放式長文本生成的評分模型，並通過對好壞輸出給予不同的獎勵來指導其訓練。PrefBERT在兩個包含多樣化長文本風格和Likert評分質量的回應評估數據集上進行訓練，能夠比傳統指標ROUGE-L和BERTScore提供更好的語義獎勵反饋，從而有效支持GRPO。通過包括LLM作為評判者、人類評分和定性分析在內的全面評估，我們展示了PrefBERT在多句子和段落長度的回應上訓練後，仍能在各種長段落中保持可靠性，並與GRPO所需的可驗證獎勵良好對齊。人類評估證實，使用PrefBERT作為獎勵信號來訓練策略模型，所得到的回應比使用傳統指標訓練的模型更符合人類偏好。我們的代碼可在https://github.com/zli12321/long_form_rl獲取。

English

Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.

語義感知獎勵機制在自由形式生成中的開放式R1訓練

Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

摘要

Support