在RLHF中衡量代碼補全的記憶能力

摘要

以人類反饋的強化學習（RLHF）已成為對齊大型模型至使用者偏好的主要方法。與微調不同，對於微調，有許多關於訓練數據記憶的研究，但目前尚不清楚記憶如何受到或在RLHF對齊過程中引入的影響。了解這種關係很重要，因為可能會收集並使用真實使用者數據來對齊大型模型；如果在RLHF過程中記憶使用者數據，並在後續重複，這可能引起隱私問題。在這項工作中，我們分析了訓練數據記憶如何在RLHF的每個階段中浮現並傳播。我們專注於代碼完成模型的研究，因為代碼完成是大型語言模型最受歡迎的用例之一。我們發現，與直接在這些數據上進行微調對齊相比，RLHF顯著降低了用於獎勵建模和強化學習的數據被記憶的機會，但在RLHF微調階段已經記憶的例子，在大多數情況下，在RLHF後仍將保持記憶。

English

Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF. We focus our study on code completion models, as code completion is one of the most popular use cases for large language models. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized, in comparison to aligning via directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF.

在RLHF中衡量代碼補全的記憶能力

Measuring memorization in RLHF for code completion

摘要

Support