둘이 함께: 대규모 언어 모델의 맥락적 무결성을 위한 상호 보완적 자기 증류

초록

맥락적 무결성(Contextual Integrity, CI)은 프라이버시를 단순히 정보를 숨기는 것이 아니라, 주어진 맥락의 규범에 따라 정보 흐름을 통제하는 것으로 정의한다. 대규모 언어 모델이 민감한 워크플로를 처리하는 개인 에이전트로 점점 더 많이 배치됨에 따라 CI를 준수하는 것이 중요해지고 있다. 그러나 최첨단 모델조차도 정보 공개 결정에서 신뢰할 수 없으며, 기존의 완화 전략은 종종 기본 작업 성능을 저하시킨다. 이러한 프라이버시-유틸리티 트레이드오프를 극복하기 위해, 우리는 정보 억제를 작업 해결로부터 분리하는 상보적 자기 증류 프레임워크인 SELFCI를 제안한다. SELFCI는 피드백으로부터 도출된 서로 다른 교사 분포에 대해 두 개의 독립적인 역방향 KL 발산을 공동으로 최적화한다. 하나는 유틸리티를 위해 작업 관련 정보를 보존하도록 장려하고, 다른 하나는 최소한의 적절한 공개를 강제한다. 이러한 상보적 공식은 전문가 곱(Product-of-Experts, PoE) 목표를 유도하여 정책을 능력 및 프라이버시 요구사항의 교집합에 정렬시킨다. 실증 평가는 SELFCI가 값비싼 외부 감독에 의존하지 않고도 온라인 강화 학습 알고리즘(예: GRPO)과 같은 경쟁력 있는 기준선을 일관되게 능가함을 보여준다. 이러한 추세는 에이전틱 워크플로와 축적된 개인 맥락을 포함하는 도메인 외 설정까지 확장되며, 이는 SELFCI가 CI 정렬을 위한 실용적인 경로를 제공함을 시사한다.

English

Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agents handling sensitive workflows, adhering to CI becomes critical. However, even frontier models remain unreliable in making disclosure decisions, and existing mitigation strategies often degrade underlying task performance. To overcome this privacy-utility trade-off, we propose SELFCI, a complementary self-distillation framework that decouples information suppression from task resolution. SELFCI jointly optimizes two independent reverse KL divergences over distinct teacher distributions derived from feedback: one encourages preserving task-relevant information for utility, while the other enforces minimal and appropriate disclosure. This complementary formulation induces a Product-of-Experts (PoE) target, aligning the policy with the intersection of capability and privacy requirements. Empirical evaluations demonstrate that SELFCI, without relying on costly external supervision, consistently outperforms competitive baselines such as online reinforcement learning algorithms (e.g., GRPO). These trends further extend to out-of-domain settings involving agentic workflows and accumulated private context, suggesting that SELFCI provides a practical path toward CI alignment.