코드 완성을 위한 RLHF에서의 암기 측정

초록

인간 피드백을 통한 강화 학습(RLHF)은 대형 모델을 사용자 선호에 맞추는 주요 방법으로 자리 잡았습니다. 미세 조정(fine-tuning)과 달리, RLHF 정렬 과정에서 데이터 기억화(memorization)가 어떻게 영향을 받거나 도입되는지는 명확하지 않습니다. 이러한 관계를 이해하는 것은 실제 사용자 데이터가 수집되어 대형 모델을 정렬하는 데 사용될 수 있기 때문에 중요합니다. 만약 RLHF 과정에서 사용자 데이터가 기억화되고 이후에 재현된다면, 이는 프라이버시 문제를 야기할 수 있습니다. 본 연구에서는 RLHF의 각 단계를 통해 훈련 데이터 기억화가 어떻게 나타나고 전파되는지 분석합니다. 우리는 코드 완성 모델에 초점을 맞추었는데, 이는 대형 언어 모델의 가장 인기 있는 사용 사례 중 하나이기 때문입니다. 연구 결과, RLHF는 보상 모델링 및 강화 학습에 사용된 데이터가 기억화될 가능성을 해당 데이터에 직접 미세 조정을 통해 정렬하는 방법에 비해 상당히 감소시키는 것으로 나타났습니다. 그러나 RLHF의 미세 조정 단계에서 이미 기억화된 예제들은 대부분의 경우 RLHF 이후에도 기억화된 상태로 남아있습니다.

English

Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF. We focus our study on code completion models, as code completion is one of the most popular use cases for large language models. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized, in comparison to aligning via directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF.

코드 완성을 위한 RLHF에서의 암기 측정

Measuring memorization in RLHF for code completion

초록

Support