大型語言模型的個性化安全:基準測試與基於規劃的代理方法
Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach
May 24, 2025
作者: Yuchen Wu, Edward Sun, Kaijie Zhu, Jianxun Lian, Jose Hernandez-Orallo, Aylin Caliskan, Jindong Wang
cs.AI
摘要
大型语言模型(LLMs)在面对相同提示时,通常会对所有用户生成相同或相似的回应,这在用户脆弱性差异巨大的高风险应用中带来了严重的安全隐患。现有的安全评估主要依赖于上下文无关的指标——如事实性、偏见或毒性——却忽视了同一回应可能因用户背景或状况不同而带来截然不同的风险。我们引入了个性化安全以填补这一空白,并提出了PENGUIN——一个包含14,000个场景的基准,覆盖七个敏感领域,同时包含上下文丰富和上下文无关的变体。通过对六种领先的LLMs进行评估,我们证明了个性化用户信息显著提升了安全评分达43.2%,证实了个性化在安全对齐中的有效性。然而,并非所有上下文属性对安全提升的贡献均等。为此,我们开发了RAISE——一个无需训练的两阶段代理框架,它策略性地获取用户特定的背景信息。RAISE在六种基础LLMs上最高提升了31.6%的安全评分,同时保持了仅平均2.7次用户查询的低交互成本。我们的研究结果强调了在安全关键领域选择性信息收集的重要性,并提供了一种无需模型重新训练即可个性化LLM回应的实用解决方案。这项工作为适应个体用户上下文而非假设通用危害标准的安全研究奠定了基础。
English
Large language models (LLMs) typically generate identical or similar
responses for all users given the same prompt, posing serious safety risks in
high-stakes applications where user vulnerabilities differ widely. Existing
safety evaluations primarily rely on context-independent metrics - such as
factuality, bias, or toxicity - overlooking the fact that the same response may
carry divergent risks depending on the user's background or condition. We
introduce personalized safety to fill this gap and present PENGUIN - a
benchmark comprising 14,000 scenarios across seven sensitive domains with both
context-rich and context-free variants. Evaluating six leading LLMs, we
demonstrate that personalized user information significantly improves safety
scores by 43.2%, confirming the effectiveness of personalization in safety
alignment. However, not all context attributes contribute equally to safety
enhancement. To address this, we develop RAISE - a training-free, two-stage
agent framework that strategically acquires user-specific background. RAISE
improves safety scores by up to 31.6% over six vanilla LLMs, while maintaining
a low interaction cost of just 2.7 user queries on average. Our findings
highlight the importance of selective information gathering in safety-critical
domains and offer a practical solution for personalizing LLM responses without
model retraining. This work establishes a foundation for safety research that
adapts to individual user contexts rather than assuming a universal harm
standard.Summary
AI-Generated Summary