言語モデルを追跡書き換えによる不正な蒸留から保護する手法

要旨

知識蒸留は、大規模言語モデル（LLM）の能力を、より小型で効率的な生徒モデルに転移させるために広く採用されている技術である。しかし、知識蒸留の不正利用は、先進的なモデル開発に費やされた多大な労力とコストに対して不当な利益を得る行為である。本研究では、教師モデルが生成する推論過程を改変する手法を検討し、不正な蒸留を阻止する以下の2つの目的を達成する。(1) 反蒸留（anti-distillation）、すなわち問い合わせ応答の学習有用性を低下させること、(2) API透かし（API watermarking）、すなわち生徒モデルに検証可能な署名を埋め込むことである。我々は、回答の正確性と意味的一貫性を保ちながら、教師の推論出力を動的に書き換える複数のアプローチを提案する。これらにはLLMの書き換え能力を活用する手法と、勾配ベースの技術を用いる手法が含まれる。実験結果から、単純な指示に基づく書き換えアプローチが、教師モデルの性能を維持あるいは向上させつつ、強力な反蒸留効果を発揮することが示された。さらに、本手法により埋め込まれた透かしが、実質的に誤検出なしに確実に検出可能であることも実証する。コードはhttps://github.com/xhOwenMa/trace-rewriting で公開されている。

English

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) anti-distillation, or degrading the training usefulness of query responses, and (2) API watermarking, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables embedding watermarks that can be reliably detected with essentially no false alarms. Our code is available at https://github.com/xhOwenMa/trace-rewriting.

言語モデルを追跡書き換えによる不正な蒸留から保護する手法

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

要旨

Support