通过合成任务与强化学习指导大型语言模型保持上下文一致性

摘要

教导大型语言模型（LLMs）在给定上下文中保持忠实性，对于构建可靠的信息检索系统至关重要。为此，我们提出了一个系统化框架——CANOE，旨在无需人工标注的情况下，提升LLMs在短文本和长文本生成任务中的忠实度。具体而言，我们首先通过合成包含四种多样化任务的短文本问答（QA）数据，构建高质量且易于验证的训练数据集，无需人工标注。此外，我们提出了Dual-GRPO，一种基于规则的强化学习方法，该方法包含三种从合成短文本QA数据中提取的定制化规则奖励，同时优化短文本和长文本的响应生成。值得注意的是，Dual-GRPO无需手动标注偏好数据来训练奖励模型，并避免了仅依赖合成短文本QA数据时对短文本生成的过度优化。实验结果表明，CANOE在11种不同的下游任务中显著提升了LLMs的忠实度，甚至超越了最先进的LLMs，如GPT-4o和OpenAI o1。

English

Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

通过合成任务与强化学习指导大型语言模型保持上下文一致性

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

摘要

Support