推論と強化学習によるLLMの文脈的整合性

要旨

ユーザーに代わって意思決定を行う自律エージェントの時代が到来する中、特定のタスクを実行する際にどの情報を共有すべきかという文脈的整合性（Contextual Integrity, CI）の確保が、この分野の中心的な課題となっています。我々は、CIを実現するためには、エージェントが動作している文脈について推論を行う必要があると主張します。これを検証するため、まず、LLMに対して、どの情報を開示するかを決定する際にCIについて明示的に推論するよう促します。次に、このアプローチを拡張し、CIを達成するために必要な推論をモデルにさらに浸透させる強化学習（Reinforcement Learning, RL）フレームワークを開発します。多様な文脈と情報開示の規範を持つ、わずか700例の自動生成された合成データセットを使用して、我々の手法が、複数のモデルサイズやファミリーにわたってタスク性能を維持しつつ、不適切な情報開示を大幅に減少させることを示します。重要なことに、この合成データセットからの改善は、人間によるアノテーションを含み、AIアシスタントの行動やツール呼び出しにおけるプライバシー漏洩を評価するPrivacyLensなどの確立されたCIベンチマークにも転移します。

English

As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only sim700 examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

推論と強化学習によるLLMの文脈的整合性

Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

要旨

Support