ChatPaper.aiChatPaper

通过合成任务与强化学习指导大型语言模型保持上下文一致性

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

May 22, 2025
作者: Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
cs.AI

摘要

教导大型语言模型(LLMs)在给定上下文中保持忠实性,对于构建可靠的信息检索系统至关重要。为此,我们提出了一个系统化框架——CANOE,旨在无需人工标注的情况下,提升LLMs在短文本和长文本生成任务中的忠实度。具体而言,我们首先通过合成包含四种多样化任务的短文本问答(QA)数据,构建高质量且易于验证的训练数据集,无需人工标注。此外,我们提出了Dual-GRPO,一种基于规则的强化学习方法,该方法包含三种从合成短文本QA数据中提取的定制化规则奖励,同时优化短文本和长文本的响应生成。值得注意的是,Dual-GRPO无需手动标注偏好数据来训练奖励模型,并避免了仅依赖合成短文本QA数据时对短文本生成的过度优化。实验结果表明,CANOE在11种不同的下游任务中显著提升了LLMs的忠实度,甚至超越了最先进的LLMs,如GPT-4o和OpenAI o1。
English
Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

Summary

AI-Generated Summary

PDF105May 26, 2025