通过轨迹重写保护语言模型防范未经授权的蒸馏提取

摘要

知识蒸馏是一种被广泛采用的技术，用于将大型语言模型的能力迁移至更小巧、更高效的学生模型。然而，未经授权的知识蒸馏行为不正当地利用了开发前沿模型所投入的大量精力与成本。我们研究通过修改教师模型生成的推理轨迹来实现两个阻止非法蒸馏的目标：（1）反蒸馏，即降低查询响应在训练中的可用性；（2）API水印技术，即在学生模型中嵌入可验证的签名。我们提出了多种动态重写教师模型推理输出的方法，在保持答案正确性与语义连贯性的同时，其中两种方法利用大型语言模型的重写能力，其他方法则采用基于梯度的技术。实验表明，基于简单指令的重写方法在维持甚至提升教师模型性能的同时，能实现显著的反蒸馏效果。此外，我们的重写方法还能嵌入可被可靠检测的水印，且基本不会产生误报。相关代码已发布于https://github.com/xhOwenMa/trace-rewriting。

English

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) anti-distillation, or degrading the training usefulness of query responses, and (2) API watermarking, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables embedding watermarks that can be reliably detected with essentially no false alarms. Our code is available at https://github.com/xhOwenMa/trace-rewriting.

通过轨迹重写保护语言模型防范未经授权的蒸馏提取

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

摘要

Support