透過強化學習實現高效差分隱私的大型語言模型微調

摘要

數據隱私與模型效用之間的緊張關係，已成為訓練於敏感語料（包括醫療領域）的大型語言模型（LLMs）實際部署的關鍵瓶頸。差分隱私隨機梯度下降（DP-SGD）雖能確保形式上的隱私保護，但代價顯著：梯度被強制裁剪並加入噪聲，導致樣本效率與最終準確性下降。儘管已有眾多變體提出以緩解此權衡，但它們均存在一個共同缺陷：其控制參數是硬編碼的、全局性的，且對優化過程的動態變化視而不見。因此，實踐者被迫在追求效用時過度消耗隱私預算，或為遵守隱私限制而接受平庸模型。我們提出了RLDP，這是首個將DP優化本身視為可適應現代深度強化學習（RL）的閉環控制問題的框架。RLDP持續感知學習動態的豐富統計量，並通過選擇細粒度的每參數梯度裁剪閾值及注入高斯噪聲的幅度來採取行動。在語言模型微調過程中，一個軟性行動者-評論家（SAC）超策略在線訓練；它從零開始學習如何在關鍵時刻將隱私預算分配至關鍵之處。在GPT2-small、Llama-1B、Llama-3B及Mistral-7B上進行的超過1600次消融實驗中，RLDP實現了困惑度降低1.3-30.5%（平均5.4%）及下游效用平均提升5.6%的成果。RLDP僅需消耗基準線梯度更新預算的13-43%（平均加速71%），即可達到各基準線的最終效用，同時嚴格遵守相同的(ε, δ)-DP約定，並展現出同等或更低的成員推斷與金絲雀提取攻擊易感性。

English

The tension between data privacy and model utility has become the defining bottleneck for the practical deployment of large language models (LLMs) trained on sensitive corpora including healthcare. Differentially private stochastic gradient descent (DP-SGD) guarantees formal privacy, yet it does so at a pronounced cost: gradients are forcibly clipped and perturbed with noise, degrading sample efficiency and final accuracy. Numerous variants have been proposed to soften this trade-off, but they all share a handicap: their control knobs are hard-coded, global, and oblivious to the evolving optimization landscape. Consequently, practitioners are forced either to over-spend privacy budget in pursuit of utility, or to accept mediocre models in order to stay within privacy constraints. We present RLDP, the first framework to cast DP optimization itself as a closed-loop control problem amenable to modern deep reinforcement learning (RL). RLDP continuously senses rich statistics of the learning dynamics and acts by selecting fine-grained per parameter gradient-clipping thresholds as well as the magnitude of injected Gaussian noise. A soft actor-critic (SAC) hyper-policy is trained online during language model fine-tuning; it learns, from scratch, how to allocate the privacy budget where it matters and when it matters. Across more than 1,600 ablation experiments on GPT2-small, Llama-1B, Llama-3B, and Mistral-7B, RLDP delivers perplexity reductions of 1.3-30.5% (mean 5.4%) and an average 5.6% downstream utility gain. RLDP reaches each baseline's final utility after only 13-43% of the gradient-update budget (mean speed-up 71%), all while honoring the same (epsilon, delta)-DP contract and exhibiting equal or lower susceptibility to membership-inference and canary-extraction attacks.

透過強化學習實現高效差分隱私的大型語言模型微調

Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning

摘要

Support