EmbTracker: 연합 언어 모델을 위한 추적 가능 블랙박스 워터마킹

초록

연합 언어 모델(FedLM)은 원본 데이터 공유 없이 협업 학습을 가능하게 하지만, 모든 비신뢰 클라이언트가 수신한 기능적 모델 인스턴스를 유출할 수 있어 중요한 취약점을 야기합니다. FedLM용 기존 워터마킹 방식은 화이트박스 접근과 클라이언트 측 협력이 필요하며, 개별 추적성보다는 그룹 수준의 소유권 증명만 제공하는 경우가 많습니다. 본 논문에서는 FedLM에 특화된 서버 측 추적형 블랙박스 워터마킹 프레임워크인 EmbTracker를 제안합니다. EmbTracker는 간단한 API 쿼리를 통해 탐지 가능한 백도어 기반 워터마킹을 삽입하여 블랙박스 검증 가능성을 달성합니다. 클라이언트별 고유 식별 워터마킹을 각 클라이언트에 배포된 모델에 주입함으로써 클라이언트 수준 추적성을 실현합니다. 이를 통해 유출된 모델의 특정 배포자를 식별할 수 있으며, 비협조적 참여자에게도 강건성을 보장합니다. 다양한 언어 및 시각-언어 모델에 대한 폭넓은 실험을 통해 EmbTracker가 100%에 가까운 검증률로 강력한 추적성을 달성하고, 제거 공격(미세 조정, 가지치기, 양자화)에 대한 높은 복원력을 가지며, 주 작업 성능에 미치는 영향이 미미함(대체로 1-2% 이내)을 입증하였습니다.

English

Federated Language Model (FedLM) allows a collaborative learning without sharing raw data, yet it introduces a critical vulnerability, as every untrustworthy client may leak the received functional model instance. Current watermarking schemes for FedLM often require white-box access and client-side cooperation, providing only group-level proof of ownership rather than individual traceability. We propose EmbTracker, a server-side, traceable black-box watermarking framework specifically designed for FedLMs. EmbTracker achieves black-box verifiability by embedding a backdoor-based watermark detectable through simple API queries. Client-level traceability is realized by injecting unique identity-specific watermarks into the model distributed to each client. In this way, a leaked model can be attributed to a specific culprit, ensuring robustness even against non-cooperative participants. Extensive experiments on various language and vision-language models demonstrate that EmbTracker achieves robust traceability with verification rates near 100\%, high resilience against removal attacks (fine-tuning, pruning, quantization), and negligible impact on primary task performance (typically within 1-2\%).

EmbTracker: 연합 언어 모델을 위한 추적 가능 블랙박스 워터마킹

EmbTracker: Traceable Black-box Watermarking for Federated Language Models

초록

Support