PrivacyAlign：面向LLM代理的上下文隐私对齐

摘要

代表用户行为的人工智能代理在不断做出决策，而要让用户信任这些代理，其决策必须与用户的真实意图相一致。隐私是代理对齐中的一个重要问题：代理发送的每一条消息、帖子或工具调用，都需要根据上下文判断哪些信息适合分享、分享给谁以及在何种条件下分享。由于这些判断依赖于社会期望和规范，人类判断不仅标记隐私侵犯行为，更帮助界定隐私侵犯本身。现有工作依赖不可靠的代理指标进行训练和评估，而我们则将人类判断置于代理隐私对齐的核心位置。我们提出PrivacyAlign数据集，包含1,350个样本，来自599位标注者对当前大语言模型实际泄露隐私的多样化场景提供的3,516条详细标注，并以此为基础将人类隐私规范融入对齐训练和自动评估。基于这些标注，我们首先证明，若让作为评判者的大语言模型参考同一提示下人类对参考响应的标注和解释进行条件判断，其评判结果会更为可靠。接着我们引入标注条件奖励建模，在强化学习过程中利用这些标注对新响应进行评分。实验表明，使用该奖励训练的小型开源权重代理能更好地与人类隐私规范对齐，在PrivacyAlign及现有代理隐私基准测试上均取得显著提升。

English

AI agents acting on behalf of users are constantly making decisions, and for users to trust their agents, those decisions must align with what they actually want. Privacy is an important alignment problem for agents: every message, post, or tool call an agent makes is a contextual judgment about what is appropriate to share, with whom, and under which conditions. Because such judgments depend on social expectations and norms, human judgment does not merely label privacy violations but also helps define them. While existing work relies on unreliable proxies for both training and evaluation, we place human judgment at the center of agentic privacy alignment. We introduce PrivacyAlign, a dataset of 1,350 samples with 3,516 detailed annotations from 599 unique annotators across diverse scenarios where current LLMs actually leak, and use it to ground both alignment training and automated evaluation in human privacy norms. Building on these annotations, we first show that conditioning LLM judges on human annotations and explanations for reference responses to the same prompt makes their judgments more reliable. We then introduce annotation-conditioned reward modeling, which uses these annotations to score new responses during RL, and show that small open-weight agents trained with this reward better align with human privacy norms, with strong gains on PrivacyAlign and existing privacy benchmarks for agents.