PrivacyAlign: LLMエージェントのための文脈的プライバシーアライメント

要旨

ユーザーに代わって行動するAIエージェントは常に意思決定を行っており、ユーザーがエージェントを信頼するためには、その決定がユーザーの実際の意図と一致していなければならない。プライバシーはエージェントにとって重要なアライメント問題である。エージェントが行うすべてのメッセージ、投稿、ツール呼び出しは、何を誰と、どのような条件で共有することが適切かという文脈上の判断である。こうした判断は社会的期待や規範に依存するため、人間の判断は単にプライバシー違反をラベル付けするだけでなく、その定義にも寄与する。既存の研究は訓練と評価の両方において信頼性の低いプロキシに依存しているが、我々はエージェントのプライバシーアライメントの中心に人間の判断を据える。我々はPrivacyAlignを導入する。これは、現在のLLMが実際に漏洩する多様なシナリオにおいて、599人の異なるアノテーターから得られた3,516件の詳細なアノテーションを含む1,350サンプルのデータセットであり、これを人間のプライバシー規範に基づくアライメント訓練と自動評価の基盤として用いる。これらのアノテーションを基に、まず同じプロンプトに対する参照応答に関する人間のアノテーションと説明をLLM評価器に条件付けすることで、その判断の信頼性が向上することを示す。次に、アノテーション条件付き報酬モデリングを導入し、強化学習中にこれらのアノテーションを用いて新しい応答をスコアリングする。この報酬で訓練された小規模なオープンウェイトエージェントが人間のプライバシー規範とよりよく整合し、PrivacyAlignおよび既存のエージェント用プライバシーベンチマークにおいて大きな改善を示すことを明らかにする。

English

AI agents acting on behalf of users are constantly making decisions, and for users to trust their agents, those decisions must align with what they actually want. Privacy is an important alignment problem for agents: every message, post, or tool call an agent makes is a contextual judgment about what is appropriate to share, with whom, and under which conditions. Because such judgments depend on social expectations and norms, human judgment does not merely label privacy violations but also helps define them. While existing work relies on unreliable proxies for both training and evaluation, we place human judgment at the center of agentic privacy alignment. We introduce PrivacyAlign, a dataset of 1,350 samples with 3,516 detailed annotations from 599 unique annotators across diverse scenarios where current LLMs actually leak, and use it to ground both alignment training and automated evaluation in human privacy norms. Building on these annotations, we first show that conditioning LLM judges on human annotations and explanations for reference responses to the same prompt makes their judgments more reliable. We then introduce annotation-conditioned reward modeling, which uses these annotations to score new responses during RL, and show that small open-weight agents trained with this reward better align with human privacy norms, with strong gains on PrivacyAlign and existing privacy benchmarks for agents.