透過合成任務與強化學習教導大型語言模型保持上下文忠實性
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
May 22, 2025
作者: Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
cs.AI
摘要
教導大型語言模型(LLMs)在提供的上下文中保持忠實性,對於構建可靠的信息檢索系統至關重要。因此,我們提出了一個系統化的框架——CANOE,旨在無需人工註釋的情況下,提升LLMs在短篇和長篇生成任務中的忠實性。具體而言,我們首先通過四種多樣化的任務合成短篇問答(QA)數據,以構建高質量且易於驗證的訓練數據,而無需人工註解。此外,我們提出了Dual-GRPO,這是一種基於規則的強化學習方法,它包含三種從合成的短篇QA數據中提取的定制化規則獎勵,同時優化短篇和長篇回應的生成。值得注意的是,Dual-GRPO消除了手動標記偏好數據以訓練獎勵模型的需求,並避免了僅依賴於合成的短篇QA數據時對短篇生成的過度優化。實驗結果顯示,CANOE在11種不同的下游任務中極大地提升了LLMs的忠實性,甚至超越了最先進的LLMs,例如GPT-4o和OpenAI o1。
English
Teaching large language models (LLMs) to be faithful in the provided context
is crucial for building reliable information-seeking systems. Therefore, we
propose a systematic framework, CANOE, to improve the faithfulness of LLMs in
both short-form and long-form generation tasks without human annotations.
Specifically, we first synthesize short-form question-answering (QA) data with
four diverse tasks to construct high-quality and easily verifiable training
data without human annotation. Also, we propose Dual-GRPO, a rule-based
reinforcement learning method that includes three tailored rule-based rewards
derived from synthesized short-form QA data, while simultaneously optimizing
both short-form and long-form response generation. Notably, Dual-GRPO
eliminates the need to manually label preference data to train reward models
and avoids over-optimizing short-form generation when relying only on the
synthesized short-form QA data. Experimental results show that CANOE greatly
improves the faithfulness of LLMs across 11 different downstream tasks, even
outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.Summary
AI-Generated Summary