Proteggere i Modelli Linguistici dalla Distillazione Non Autorizzata tramite Riscrittura delle Tracce

Abstract

La distillazione della conoscenza è una tecnica ampiamente adottata per trasferire le capacità dai LLM a modelli studente più piccoli ed efficienti. Tuttavia, l'uso non autorizzato della distillazione della conoscenza sfrutta ingiustamente il notevole impegno e costo investiti nello sviluppo di modelli all'avanguardia. Indaghiamo metodi per modificare le tracce di ragionamento generate dal docente per raggiungere due obiettivi che scoraggiano la distillazione non autorizzata: (1) l'anti-distillazione, ovvero il degradare l'utilità ai fini addestrativi delle risposte alle query, e (2) la filigrana delle API, che incorpora firme verificabili nei modelli studente. Introduciamo diversi approcci per riscrivere dinamicamente gli output di ragionamento di un docente preservando la correttezza della risposta e la coerenza semantica. Due di questi sfruttano le capacità di riscrittura dei LLM, mentre altri utilizzano tecniche basate sul gradiente. I nostri esperimenti mostrano che un semplice approccio di riscrittura basato su istruzioni ottiene un forte effetto anti-distillazione mantenendo o addirittura migliorando le prestazioni del docente. Inoltre, dimostriamo che il nostro approccio di riscrittura consente anche di incorporare filigrane che possono essere rilevate in modo affidabile con essenzialmente nessun falso allarme. Il nostro codice è disponibile all'indirizzo https://github.com/xhOwenMa/trace-rewriting.

English

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) anti-distillation, or degrading the training usefulness of query responses, and (2) API watermarking, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables embedding watermarks that can be reliably detected with essentially no false alarms. Our code is available at https://github.com/xhOwenMa/trace-rewriting.

Proteggere i Modelli Linguistici dalla Distillazione Non Autorizzata tramite Riscrittura delle Tracce

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Abstract

Support