隱私崩潰:良性微調可能破壞語言模型中的情境隱私
Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
January 21, 2026
作者: Anmol Goel, Cornelius Emde, Sangdoo Yun, Seong Joon Oh, Martin Gubri
cs.AI
摘要
我们发现语言模型存在一种新现象:对前沿模型进行良性微调可能导致隐私崩溃。通过实验研究,我们识别出训练数据中多种微妙模式会削弱情境隐私保护能力,包括对助人性的优化、用户信息的暴露、情感化与主观性对话、调试代码时打印内部变量等。微调后的模型会丧失对情境隐私规范的判断力,不适当地通过工具共享信息,并跨越情境边界违反记忆隔离原则。隐私崩溃属于"静默失效",因为模型在标准安全性和实用性基准测试中仍保持优异表现,却存在严重的隐私漏洞。我们的实验在六种模型(闭源与开源权重)、五种微调数据集(真实场景与受控数据)及两类任务(智能体与基于记忆的任务)中均观察到隐私崩溃证据。机制分析表明,与任务相关特征在微调中得以保留不同,隐私表征具有独特的脆弱性。这些发现揭示了当前安全评估体系存在的重大缺陷,特别是在部署专业化智能体时尤为突出。
English
We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.