언어 모델의 무단 증류 방지를 위한 추적 재작성 기법

초록

지식 증류는 대규모 언어 모델의 능력을 더 작고 효율적인 학생 모델로 전달하기 위해 널리 채택된 기술입니다. 그러나 권한 없는 지식 증류 사용은 첨단 모델 개발에 투입된 상당한 노력과 비용을 부당하게 이용하는 행위입니다. 본 연구에서는 교사 모델이 생성한 추론 흔적을 수정하여 권한 없는 증류를 방지하는 두 가지 목적, 즉 (1) 훈련 유용성을 저하시키는 반-증류 효과와 (2) 학생 모델에 검증 가능한 서명을 삽입하는 API 워터마킹을 달성하는 방법을 탐구합니다. 우리는 답변의 정확성과 의미적 일관성을 유지하면서 교사 모델의 추론 결과를 동적으로 재작성하는 여러 접근법을 소개합니다. 이 중 두 가지는 대규모 언어 모델의 재작성 능력을 활용하고, 나머지는 그래디언트 기반 기술을 사용합니다. 우리의 실험 결과, 간단한 지시 기반 재작성 접근법이 교사 모델의 성능을 유지하거나 오히려 개선하면서도 강력한 반-증류 효과를 달성함을 보여줍니다. 더 나아가, 우리의 재작성 접근법은 오탐지가 거의 없이도 신뢰성 있게 검출 가능한 워터마크 삽입도 가능하게 함을 확인했습니다. 코드는 https://github.com/xhOwenMa/trace-rewriting에서 확인할 수 있습니다.

English

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) anti-distillation, or degrading the training usefulness of query responses, and (2) API watermarking, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables embedding watermarks that can be reliably detected with essentially no false alarms. Our code is available at https://github.com/xhOwenMa/trace-rewriting.

언어 모델의 무단 증류 방지를 위한 추적 재작성 기법

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

초록

Support