大型语言模型是否易受偏好削弱攻击(PUA)影响?一项诊断偏好对齐与现实有效性权衡的因子分析方法论
Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity
January 10, 2026
作者: Hongjun An, Yiliang Song, Jiangan Chen, Jiawei Shao, Chi Zhang, Xuelong Li
cs.AI
摘要
大型语言模型(LLM)的训练常以偏好对齐为优化目标,奖励那些被认为有助于互动且友好的输出。然而这种偏好导向的目标可能被恶意利用:操纵性提示可诱导模型倾向于迎合用户认同,而非坚持真相导向的修正。本研究通过偏好颠覆攻击(PUA)——一种利用模型取悦用户偏好而牺牲真实性的操纵性提示策略,系统检验对齐模型的脆弱性。我们提出一种诊断方法,采用2×2^4因子实验设计,通过可解释的效应分解将提示引发的输出偏移归因为系统目标(真相导向vs偏好导向)与PUA对话因子(指令控制、人格贬损、条件认可、现实否定)的交互作用,该方法比聚合基准测试能提供更细粒度的定向分析。令人惊讶的是,更先进的模型有时反而更容易受操纵性提示影响。除主导性的现实否定因子外,我们还观察到模型特定的符号反转及与PUA因子的交互效应,表明需要定制化防御而非统一鲁棒性方案。这些发现不仅提供了一种可复现的因子评估方法,为RLHF等训练后流程提供细粒度诊断,更能通过深化对偏好对齐风险与操纵提示影响的理解,为LLM产品迭代中的权衡决策提供新视角。
English
Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model's desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors (directive control, personal derogation, conditional approval, reality denial) within a controlled 2 times 2^4 design. Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts. Beyond the dominant reality-denial factor, we observe model-specific sign reversals and interactions with PUA-style factors, suggesting tailored defenses rather than uniform robustness. These findings offer a novel, reproducible factorial evaluation methodology that provides finer-grained diagnostics for post-training processes like RLHF, enabling better trade-offs in the product iteration of LLMs by offering a more nuanced understanding of preference alignment risks and the impact of manipulative prompts.