二者缺一不可：面向大语言模型语境完整性的互补自蒸馏方法

摘要

情境完整性（Contextual Integrity, CI）将隐私的定义不局限于信息隐藏，而是主张信息流动需遵循特定情境下的规范。随着大语言模型越来越多地被部署为处理敏感工作流的个人代理，遵循CI变得至关重要。然而，即便是最先进的模型在披露决策方面仍不可靠，现有缓解策略往往以牺牲底层任务性能为代价。为克服这种隐私与效用的权衡，我们提出SELFCI，一种互补性自蒸馏框架，将信息抑制与任务求解解耦。SELFCI基于来自反馈的不同教师分布，联合优化两个独立的逆向KL散度：一个鼓励保留任务相关信息以保持效用，另一个则强制实现最小且适当的披露。这种互补性公式推导出一个专家乘积模型（Product-of-Experts, PoE）目标，使策略与能力与隐私要求的交集对齐。实验评估表明，SELFCI不依赖昂贵的外部监督，始终优于在线强化学习算法（如GRPO）等竞争基线。这些趋势进一步扩展到涉及代理工作流和累积私密上下文的域外场景，表明SELFCI为CI对齐提供了一条实用路径。

English

Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agents handling sensitive workflows, adhering to CI becomes critical. However, even frontier models remain unreliable in making disclosure decisions, and existing mitigation strategies often degrade underlying task performance. To overcome this privacy-utility trade-off, we propose SELFCI, a complementary self-distillation framework that decouples information suppression from task resolution. SELFCI jointly optimizes two independent reverse KL divergences over distinct teacher distributions derived from feedback: one encourages preserving task-relevant information for utility, while the other enforces minimal and appropriate disclosure. This complementary formulation induces a Product-of-Experts (PoE) target, aligning the policy with the intersection of capability and privacy requirements. Empirical evaluations demonstrate that SELFCI, without relying on costly external supervision, consistently outperforms competitive baselines such as online reinforcement learning algorithms (e.g., GRPO). These trends further extend to out-of-domain settings involving agentic workflows and accumulated private context, suggesting that SELFCI provides a practical path toward CI alignment.