대규모 언어 모델에게 합성 작업과 강화 학습을 통해 문맥적 충실성을 유지하도록 가르치기

초록

제공된 맥락에서 대형 언어 모델(LLM)이 신뢰할 수 있도록 학습시키는 것은 신뢰할 수 있는 정보 탐색 시스템을 구축하는 데 있어 매우 중요합니다. 따라서 우리는 인간의 주석 없이도 짧은 형식과 긴 형식의 생성 작업에서 LLM의 신뢰성을 향상시키기 위한 체계적인 프레임워크인 CANOE를 제안합니다. 구체적으로, 우리는 먼저 인간의 주석 없이도 고품질이고 쉽게 검증 가능한 학습 데이터를 구축하기 위해 네 가지 다양한 작업으로 짧은 형식의 질문-응답(QA) 데이터를 합성합니다. 또한, 합성된 짧은 형식 QA 데이터에서 파생된 세 가지 맞춤형 규칙 기반 보상을 포함하는 규칙 기반 강화 학습 방법인 Dual-GRPO를 제안하며, 이를 통해 짧은 형식과 긴 형식 응답 생성을 동시에 최적화합니다. 특히, Dual-GRPO는 보상 모델을 학습하기 위해 선호 데이터를 수동으로 레이블링할 필요를 없애고, 합성된 짧은 형식 QA 데이터에만 의존할 때 짧은 형식 생성이 과도하게 최적화되는 것을 방지합니다. 실험 결과는 CANOE가 11가지 다양한 하위 작업에서 LLM의 신뢰성을 크게 향상시키며, 가장 발전된 LLM인 GPT-4o와 OpenAI o1을 능가하는 성능을 보여줍니다.

English

Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

대규모 언어 모델에게 합성 작업과 강화 학습을 통해 문맥적 충실성을 유지하도록 가르치기

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

초록

Support