面向自由生成任务的开放式R1训练中的语义感知奖励机制

摘要

评估开放式长文本生成具有挑战性，因为难以明确界定优质与劣质输出之间的界限。现有方法往往忽略了连贯性、风格或相关性等关键方面，或受到预训练数据的偏见影响，使得开放式长文本评估成为一个尚未充分探索的问题。为填补这一空白，我们提出了PrefBERT，一个用于评估GRPO中开放式长文本生成的评分模型，并通过为优质与劣质输出设定不同的奖励来指导其训练。基于两个包含多样化长文本风格及Likert评分质量标准的响应评估数据集进行训练，PrefBERT有效支持了GRPO，相较于传统指标ROUGE-L和BERTScore，提供了更优的语义奖励反馈。通过包括LLM作为评判者、人工评分及定性分析在内的全面评估，我们展示了PrefBERT在训练于多句子及段落长度响应后，仍能在各种长篇文本中保持可靠性，并与GRPO所需的可验证奖励良好对齐。人工评估证实，使用PrefBERT作为奖励信号训练策略模型，相较于采用传统指标训练，生成的响应更符合人类偏好。我们的代码已发布于https://github.com/zli12321/long_form_rl。

English

Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.

面向自由生成任务的开放式R1训练中的语义感知奖励机制

Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

摘要

Support