隐私崩塌：良性微调可能破坏语言模型中的上下文隐私

摘要

我们发现语言模型存在一种新现象：对前沿模型进行良性微调可能导致隐私崩溃。研究表明，训练数据中多样且微妙的模式会削弱情境隐私保护能力，包括对助人为乐特性的优化、用户信息的暴露、情感化及主观性对话、调试代码时打印内部变量等。经微调的模型会丧失对情境隐私规范的判断力，不适当地向工具共享信息，并跨越情境边界侵犯记忆隐私。这种隐私崩溃属于"静默失效"，因为模型在标准安全性和实用性基准测试中仍保持优异表现，却存在严重的隐私漏洞。我们在六种模型（闭源与开源权重）、五类微调数据集（真实场景与受控数据）以及两种任务类型（智能体任务与基于记忆的任务）中均观察到隐私崩溃的证据。机制分析表明，与得以保留的任务相关特征不同，隐私表征对微调过程具有独特的脆弱性。这一发现揭示了当前安全评估体系存在的重大缺陷，尤其在专用智能体的部署方面亟待完善。

English

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

隐私崩塌：良性微调可能破坏语言模型中的上下文隐私

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

摘要

Support