추론과 강화 학습을 통한 대형 언어 모델의 문맥적 무결성

초록

사용자를 대신해 결정을 내리는 자율 에이전트 시대가 도래함에 따라, 특정 작업을 수행하면서 어떤 정보를 공유하는 것이 적절한지에 대한 문맥적 무결성(Contextual Integrity, CI)은 이 분야의 핵심 질문으로 부상하고 있습니다. 우리는 CI가 에이전트가 운영 중인 문맥에 대해 추론해야 하는 형태의 사고를 요구한다고 주장합니다. 이를 검증하기 위해, 우리는 먼저 LLM(Large Language Models)이 어떤 정보를 공개할지 결정할 때 명시적으로 CI에 대해 추론하도록 유도했습니다. 그런 다음, 이 접근법을 확장하여 CI를 달성하기 위해 필요한 추론 능력을 모델에 더욱 깊이 심어주는 강화 학습(Reinforcement Learning, RL) 프레임워크를 개발했습니다. 다양한 문맥과 정보 공개 규범을 포함하지만 단 700개의 예시로 구성된 합성 데이터셋을 사용하여, 우리의 방법이 여러 모델 크기와 계열에 걸쳐 작업 성능을 유지하면서 부적절한 정보 공개를 상당히 줄인다는 것을 보여주었습니다. 중요한 점은, 이 합성 데이터셋에서의 개선이 인간 주석이 포함되고 AI 어시스턴트의 동작 및 도구 호출에서 개인정보 유출을 평가하는 PrivacyLens와 같은 기존 CI 벤치마크로도 전이된다는 것입니다.

English

As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only sim700 examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.

추론과 강화 학습을 통한 대형 언어 모델의 문맥적 무결성

Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

초록

Support