PrivacyAlign: LLM 에이전트를 위한 맥락적 프라이버시 정렬

초록

사용자를 대신하여 행동하는 AI 에이전트는 지속적으로 결정을 내리며, 사용자가 자신의 에이전트를 신뢰하려면 이러한 결정이 사용자가 실제로 원하는 바와 일치해야 한다. 프라이버시는 에이전트에게 중요한 정렬 문제이다. 에이전트가 수행하는 모든 메시지, 게시물 또는 도구 호출은 무엇을, 누구와, 어떤 조건에서 공유하는 것이 적절한지에 대한 맥락적 판단이다. 이러한 판단은 사회적 기대와 규범에 의존하기 때문에, 인간 판단은 단순히 프라이버시 침해를 식별하는 것을 넘어 이를 정의하는 데 기여한다. 기존 연구는 훈련과 평가 모두에서 신뢰할 수 없는 대리 지표에 의존하는 반면, 우리는 인간 판단을 에이전트 프라이버시 정렬의 중심에 둔다. 본 논문에서는 현재 대규모 언어 모델(LLM)이 실제로 정보를 유출하는 다양한 시나리오에서 599명의 고유 주석자가 제공한 3,516개의 상세 주석이 포함된 1,350개 샘플로 구성된 PrivacyAlign 데이터세트를 소개하고, 이를 활용하여 인간의 프라이버시 규범에 기반한 정렬 훈련과 자동 평가를 수행한다. 이러한 주석을 바탕으로, 먼저 동일한 프롬프트에 대한 참조 응답에 인간 주석과 설명을 조건화하여 LLM 평가자의 판단을 더 신뢰할 수 있게 만드는 방법을 보여준다. 그런 다음 주석 조건화 보상 모델링(annotation-conditioned reward modeling)을 도입하여, 강화 학습 중에 이러한 주석을 사용해 새로운 응답을 평가하고, 이 보상으로 훈련된 소형 오픈 가중치 에이전트가 인간의 프라이버시 규범에 더 잘 정렬됨을 보여준다. PrivacyAlign 및 기존 에이전트 프라이버시 벤치마크에서 강력한 성능 향상을 확인했다.

English

AI agents acting on behalf of users are constantly making decisions, and for users to trust their agents, those decisions must align with what they actually want. Privacy is an important alignment problem for agents: every message, post, or tool call an agent makes is a contextual judgment about what is appropriate to share, with whom, and under which conditions. Because such judgments depend on social expectations and norms, human judgment does not merely label privacy violations but also helps define them. While existing work relies on unreliable proxies for both training and evaluation, we place human judgment at the center of agentic privacy alignment. We introduce PrivacyAlign, a dataset of 1,350 samples with 3,516 detailed annotations from 599 unique annotators across diverse scenarios where current LLMs actually leak, and use it to ground both alignment training and automated evaluation in human privacy norms. Building on these annotations, we first show that conditioning LLM judges on human annotations and explanations for reference responses to the same prompt makes their judgments more reliable. We then introduce annotation-conditioned reward modeling, which uses these annotations to score new responses during RL, and show that small open-weight agents trained with this reward better align with human privacy norms, with strong gains on PrivacyAlign and existing privacy benchmarks for agents.