Privacy-implosie: Goedbedoelde fine-tuning kan contextuele privacy in taalmodelen doorbreken

Samenvatting

Wij identificeren een nieuw fenomeen bij taalmodellen: goedaardige fine-tuning van frontiermodellen kan leiden tot privacy-collaps. Wij constateren dat diverse, subtiele patronen in trainingsdata de contextuele privacy kunnen aantasten, waaronder optimalisatie voor behulpzaamheid, blootstelling aan gebruikersinformatie, emotionele en subjectieve dialoog, en debug-code die interne variabelen print, onder andere. Gefinetunede modellen verliezen hun vermogen om contextuele privacynormen te beredeneren, delen informatie onjuist met tools en overschrijden geheugengrenzen tussen contexten. Privacy-collaps is een "stille fout" omdat modellen hoge prestaties behouden op standaard veiligheids- en functionaliteitsbenchmarks, terwijl ze ernstige privacykwetsbaarheden vertonen. Onze experimenten tonen aanwijzingen voor privacy-collaps bij zes modellen (gesloten en open gewicht), vijf finetuning-datasets (real-world en gecontroleerde data) en twee taakcategorieën (agent-gebaseerd en geheugen-gebaseerd). Onze mechanistische analyse toont aan dat privacy-representaties uniek kwetsbaar zijn voor finetuning in vergelijking met taakrelevante kenmerken die behouden blijven. Onze resultaten onthullen een kritieke kloof in huidige veiligheidsevaluaties, in het bijzonder voor de inzet van gespecialiseerde agents.

English

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

Privacy-implosie: Goedbedoelde fine-tuning kan contextuele privacy in taalmodelen doorbreken

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Samenvatting

Support