當您對開源大型語言模型進行微調時需謹慎:您的微調數據可能被暗中竊取!
Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
May 21, 2025
作者: Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Hongning Wang, Minlie Huang
cs.AI
摘要
利用專有數據對開源大型語言模型(LLMs)進行微調,現已成為下游開發者獲取特定任務LLMs的標準做法。然而,令人驚訝的是,我們揭示了這一實踐伴隨的新風險:開源LLMs的創建者可以通過簡單的後門訓練,僅需對微調後的下游模型進行黑箱訪問,便能提取出私有的下游微調數據。我們在4個廣泛使用的開源模型(參數量從3B到32B不等)和2個下游數據集上進行了全面實驗,結果表明數據提取效果驚人地高:在實際場景中,總計5,000個樣本中高達76.3%的下游微調數據(查詢)能被完美提取,而在更理想的條件下,成功率可提升至94.9%。我們還探索了一種基於檢測的防禦策略,但發現其可被改進後的攻擊繞過。總之,我們強調了這一新發現的微調數據洩露風險的緊迫性,並希望更多後續研究能推動解決這一令人擔憂的風險。實驗中使用的代碼和數據已發佈於https://github.com/thu-coai/Backdoor-Data-Extraction。
English
Fine-tuning on open-source Large Language Models (LLMs) with proprietary data
is now a standard practice for downstream developers to obtain task-specific
LLMs. Surprisingly, we reveal a new and concerning risk along with the
practice: the creator of the open-source LLMs can later extract the private
downstream fine-tuning data through simple backdoor training, only requiring
black-box access to the fine-tuned downstream model. Our comprehensive
experiments, across 4 popularly used open-source models with 3B to 32B
parameters and 2 downstream datasets, suggest that the extraction performance
can be strikingly high: in practical settings, as much as 76.3% downstream
fine-tuning data (queries) out of a total 5,000 samples can be perfectly
extracted, and the success rate can increase to 94.9% in more ideal settings.
We also explore a detection-based defense strategy but find it can be bypassed
with improved attack. Overall, we highlight the emergency of this newly
identified data breaching risk in fine-tuning, and we hope that more follow-up
research could push the progress of addressing this concerning risk. The code
and data used in our experiments are released at
https://github.com/thu-coai/Backdoor-Data-Extraction.Summary
AI-Generated Summary