大语言模型中的个性化安全：基准测试与基于规划的智能体方法

摘要

大型语言模型（LLMs）在面对相同提示时，通常为所有用户生成相同或相似的响应，这在用户脆弱性差异显著的高风险应用中带来了严重的安全隐患。现有的安全评估主要依赖于与上下文无关的指标——如事实准确性、偏见或毒性——忽视了同一响应可能因用户背景或状况不同而带来截然不同的风险。我们引入了个性化安全概念以填补这一空白，并提出了PENGUIN——一个包含14,000个场景的基准测试，覆盖七个敏感领域，同时具备上下文丰富与上下文无关的变体。通过对六个领先LLMs的评估，我们发现个性化用户信息使安全评分显著提升了43.2%，证实了在安全对齐中个性化策略的有效性。然而，并非所有上下文属性对安全提升的贡献均等。为此，我们开发了RAISE——一个无需训练、两阶段的代理框架，它策略性地获取用户特定背景。RAISE在六个基础LLMs上最高提升了31.6%的安全评分，同时保持了极低的交互成本，平均仅需2.7次用户查询。我们的研究结果强调了在安全关键领域选择性信息收集的重要性，并提供了一个无需模型重新训练即可实现LLM响应个性化的实用方案。此工作为适应个体用户上下文而非假设统一危害标准的安全研究奠定了基础。

English

Large language models (LLMs) typically generate identical or similar responses for all users given the same prompt, posing serious safety risks in high-stakes applications where user vulnerabilities differ widely. Existing safety evaluations primarily rely on context-independent metrics - such as factuality, bias, or toxicity - overlooking the fact that the same response may carry divergent risks depending on the user's background or condition. We introduce personalized safety to fill this gap and present PENGUIN - a benchmark comprising 14,000 scenarios across seven sensitive domains with both context-rich and context-free variants. Evaluating six leading LLMs, we demonstrate that personalized user information significantly improves safety scores by 43.2%, confirming the effectiveness of personalization in safety alignment. However, not all context attributes contribute equally to safety enhancement. To address this, we develop RAISE - a training-free, two-stage agent framework that strategically acquires user-specific background. RAISE improves safety scores by up to 31.6% over six vanilla LLMs, while maintaining a low interaction cost of just 2.7 user queries on average. Our findings highlight the importance of selective information gathering in safety-critical domains and offer a practical solution for personalizing LLM responses without model retraining. This work establishes a foundation for safety research that adapts to individual user contexts rather than assuming a universal harm standard.

大语言模型中的个性化安全：基准测试与基于规划的智能体方法

Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach

摘要

Support