効率的な差分プライバシーを保ったLLMのファインチューニング：強化学習によるアプローチ

要旨

データプライバシーとモデルの有用性の間の緊張は、医療を含む機密性の高いコーパスで学習された大規模言語モデル（LLM）の実用化における決定的なボトルネックとなっています。差分プライバシー確率的勾配降下法（DP-SGD）は形式的なプライバシーを保証しますが、その代償として勾配が強制的にクリップされ、ノイズが加えられるため、サンプル効率と最終的な精度が低下します。このトレードオフを緩和するために数多くのバリエーションが提案されていますが、それらすべてに共通する欠点があります。それらの制御パラメータはハードコードされており、グローバルで、最適化の状況の変化を考慮していないのです。その結果、実務者は有用性を追求するためにプライバシーバジェットを過剰に消費するか、プライバシー制約を守るために平凡なモデルを受け入れるかの選択を迫られます。我々はRLDPを提案します。これは、DP最適化そのものを現代的な深層強化学習（RL）に適した閉ループ制御問題として定式化する初めてのフレームワークです。RLDPは学習ダイナミクスの豊富な統計情報を継続的に感知し、パラメータごとのきめ細かい勾配クリッピングの閾値と注入するガウシアンノイズの大きさを選択することで行動します。ソフトアクタークリティック（SAC）ハイパーポリシーは、言語モデルのファインチューニング中にオンラインで訓練され、プライバシーバジェットをどこでいつ割り当てるかをゼロから学習します。GPT2-small、Llama-1B、Llama-3B、Mistral-7Bにおける1,600以上のアブレーション実験を通じて、RLDPは1.3-30.5%（平均5.4%）のパープレキシティ低減と平均5.6%の下流タスク有用性向上を達成しました。RLDPは各ベースラインの最終有用性に、勾配更新バジェットのわずか13-43%（平均71%の高速化）で到達し、同じ（ε, δ）-DP契約を遵守し、メンバーシップ推論攻撃とカナリア抽出攻撃に対する感受性が同等または低いことを示しました。

English

The tension between data privacy and model utility has become the defining bottleneck for the practical deployment of large language models (LLMs) trained on sensitive corpora including healthcare. Differentially private stochastic gradient descent (DP-SGD) guarantees formal privacy, yet it does so at a pronounced cost: gradients are forcibly clipped and perturbed with noise, degrading sample efficiency and final accuracy. Numerous variants have been proposed to soften this trade-off, but they all share a handicap: their control knobs are hard-coded, global, and oblivious to the evolving optimization landscape. Consequently, practitioners are forced either to over-spend privacy budget in pursuit of utility, or to accept mediocre models in order to stay within privacy constraints. We present RLDP, the first framework to cast DP optimization itself as a closed-loop control problem amenable to modern deep reinforcement learning (RL). RLDP continuously senses rich statistics of the learning dynamics and acts by selecting fine-grained per parameter gradient-clipping thresholds as well as the magnitude of injected Gaussian noise. A soft actor-critic (SAC) hyper-policy is trained online during language model fine-tuning; it learns, from scratch, how to allocate the privacy budget where it matters and when it matters. Across more than 1,600 ablation experiments on GPT2-small, Llama-1B, Llama-3B, and Mistral-7B, RLDP delivers perplexity reductions of 1.3-30.5% (mean 5.4%) and an average 5.6% downstream utility gain. RLDP reaches each baseline's final utility after only 13-43% of the gradient-update budget (mean speed-up 71%), all while honoring the same (epsilon, delta)-DP contract and exhibiting equal or lower susceptibility to membership-inference and canary-extraction attacks.

効率的な差分プライバシーを保ったLLMのファインチューニング：強化学習によるアプローチ

Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning

要旨

Support