コード補完におけるRLHFの記憶化の測定

要旨

人間のフィードバックを用いた強化学習（RLHF）は、大規模モデルをユーザーの好みに合わせるための主要な手法となっています。ファインチューニングに関しては、トレーニングデータの記憶化に関する多くの研究がありますが、RLHFのアライメントプロセスにおいて記憶化がどのように影響を受け、または導入されるかは明らかではありません。この関係を理解することは重要です。なぜなら、実際のユーザーデータが収集され、大規模モデルのアライメントに使用される可能性があるからです。もしRLHF中にユーザーデータが記憶化され、後で再生される場合、これはプライバシーの懸念を引き起こす可能性があります。本研究では、トレーニングデータの記憶化がRLHFの各段階を通じてどのように表面化し、伝播するかを分析します。私たちはコード補完モデルに焦点を当てます。なぜなら、コード補完は大規模言語モデルの最も一般的なユースケースの一つだからです。私たちは、RLHFが報酬モデリングと強化学習に使用されるデータの記憶化の可能性を、このデータに対して直接ファインチューニングを行う場合と比較して大幅に減少させることを発見しました。しかし、RLHFのファインチューニング段階ですでに記憶化された例は、大多数の場合、RLHF後も記憶化されたままであることも確認しました。

English

Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF. We focus our study on code completion models, as code completion is one of the most popular use cases for large language models. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized, in comparison to aligning via directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF.

コード補完におけるRLHFの記憶化の測定

Measuring memorization in RLHF for code completion

要旨

Support