相輔相成：大型語言模型中脈絡完整性的互補自蒸餾

摘要

情境完整性（Contextual Integrity, CI）不僅將隱私定義為單純隱藏信息，而是根據特定情境的規範來管理資訊流動。隨著大型語言模型日益被部署為處理敏感工作流程的個人代理，遵循CI變得至關重要。然而，即使是前沿模型在做出揭露決策時仍不可靠，現有的緩解策略常會降低底層任務效能。為克服此隱私-效用權衡，我們提出SELFCI，一種互補性的自我蒸餾框架，將資訊抑制與任務解析脫鉤。SELFCI根據來自反饋的兩個不同教師分佈，共同優化兩個獨立的反向KL散度：一個鼓勵保留任務相關資訊以維持效用，另一個則強制實現最小且恰當的揭露。此互補性公式產生了專家乘積（Product-of-Experts, PoE）目標，使策略與能力和隱私需求交集保持一致。實證評估顯示，SELFCI無需依賴昂貴的外部監督，即持續優於競爭基線，例如線上強化學習演算法（如GRPO）。這些趨勢進一步延伸至涉及代理工作流程與累積私密情境的領域外設定，表明SELFCI為實現CI校準提供了實用路徑。

English

Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agents handling sensitive workflows, adhering to CI becomes critical. However, even frontier models remain unreliable in making disclosure decisions, and existing mitigation strategies often degrade underlying task performance. To overcome this privacy-utility trade-off, we propose SELFCI, a complementary self-distillation framework that decouples information suppression from task resolution. SELFCI jointly optimizes two independent reverse KL divergences over distinct teacher distributions derived from feedback: one encourages preserving task-relevant information for utility, while the other enforces minimal and appropriate disclosure. This complementary formulation induces a Product-of-Experts (PoE) target, aligning the policy with the intersection of capability and privacy requirements. Empirical evaluations demonstrate that SELFCI, without relying on costly external supervision, consistently outperforms competitive baselines such as online reinforcement learning algorithms (e.g., GRPO). These trends further extend to out-of-domain settings involving agentic workflows and accumulated private context, suggesting that SELFCI provides a practical path toward CI alignment.