언어 모델이 데이터로부터 암묵적으로 자기 개선을 학습할 수 있도록 하기

초록

대규모 언어 모델(LLMs)은 개방형 텍스트 생성 작업에서 뛰어난 능력을 보여주고 있습니다. 그러나 이러한 작업의 본질적인 개방성은 모델 응답의 품질을 항상 개선할 여지가 있음을 의미합니다. 이러한 문제를 해결하기 위해 LLMs의 성능을 향상시키기 위한 다양한 접근 방식이 제안되어 왔습니다. 특히, LLMs가 스스로 응답 품질을 개선할 수 있도록 하는 데 초점이 맞춰지면서, 다양하고 고품질의 훈련 데이터를 수집하기 위한 광범위한 인간 주석 작업에 대한 의존도를 줄이는 방향으로 연구가 진행되고 있습니다. 최근에는 프롬프트 기반 방법이 효과성, 효율성, 편의성으로 인해 자기 개선 방법 중에서 널리 탐구되고 있습니다. 그러나 이러한 방법들은 일반적으로 LLMs에 명시적이고 철저하게 작성된 평가 기준(rubrics)을 입력으로 요구합니다. 실제 세계의 복잡한 개선 목표(예: 더 도움이 되고 덜 해로운 방향)를 위해 필요한 모든 평가 기준을 수동으로 도출하고 제공하는 것은 비용이 많이 들고 어려운 작업입니다. 이를 위해, 우리는 인간 선호 데이터로부터 개선 목표를 암묵적으로 학습하는 ImPlicit Self-ImprovemenT (PIT) 프레임워크를 제안합니다. PIT는 추가적인 인간 노력 없이 보상 모델을 훈련하는 데 사용되는 선호 데이터만을 요구합니다. 구체적으로, 우리는 인간 피드백을 통한 강화 학습(RLHF)의 훈련 목표를 재구성합니다. 주어진 입력에 대한 응답 품질을 최대화하는 대신, 참조 응답을 조건으로 한 응답의 품질 격차를 최대화합니다. 이렇게 함으로써, PIT는 인간 선호와 더 잘 일치시키는 개선 목표를 암묵적으로 훈련받습니다. 두 개의 실제 데이터셋과 하나의 합성 데이터셋에서의 실험 결과, 우리의 방법이 프롬프트 기반 방법을 크게 능가함을 보여줍니다.

English

Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. However, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. To address this challenge, various approaches have been proposed to enhance the performance of LLMs. There has been a growing focus on enabling LLMs to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. Recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. However, those methods usually require explicitly and thoroughly written rubrics as inputs to LLMs. It is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpful and less harmful). To this end, we propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data. PIT only requires preference data that are used to train reward models without extra human efforts. Specifically, we reformulate the training objective of reinforcement learning from human feedback (RLHF) -- instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. In this way, PIT is implicitly trained with the improvement goal of better aligning with human preferences. Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.

언어 모델이 데이터로부터 암묵적으로 자기 개선을 학습할 수 있도록 하기

Enable Language Models to Implicitly Learn Self-Improvement From Data

초록

Support