通过推理与强化学习实现大语言模型的情境完整性

摘要

隨著自主代理代表用戶做出決策的時代展開，確保情境完整性（Contextual Integrity, CI）——即在執行特定任務時應分享何種適當信息——成為該領域的核心問題。我們認為，CI要求一種推理形式，即代理需要對其運作的情境進行推理。為驗證這一點，我們首先促使大型語言模型（LLMs）在決定披露何種信息時，明確地對CI進行推理。隨後，我們通過開發一個強化學習（RL）框架來擴展這一方法，該框架進一步在模型中灌輸實現CI所需的推理能力。利用一個僅包含700個示例但具有多樣化情境和信息披露規範的合成、自動生成的數據集，我們展示了我們的方法在多種模型規模和家族中，大幅減少了不當信息披露，同時保持了任務表現。重要的是，這些改進從這一合成數據集轉移到了已建立的CI基準測試，如PrivacyLens，該基準測試包含人類註釋並評估AI助手在行動和工具調用中的隱私洩露情況。

English

As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only sim700 examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

通过推理与强化学习实现大语言模型的情境完整性

Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

摘要

Support