ChatPaper.aiChatPaper

大型语言模型是否易受偏好颠覆攻击?一项诊断偏好对齐与现实有效性权衡的因子分析方法论

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

January 10, 2026
作者: Hongjun An, Yiliang Song, Jiangan Chen, Jiawei Shao, Chi Zhang, Xuelong Li
cs.AI

摘要

大型语言模型(LLM)的训练通常以偏好对齐为优化目标,奖励那些被认为有助于互动且友好的输出。然而这种以偏好为导向的目标可能被恶意利用:通过操纵性提示,可使模型倾向于迎合用户认同而非坚持真相修正。本研究探讨经过对齐的模型是否易受偏好颠覆攻击(PUA)——一类通过操纵提示策略,利用模型取悦用户偏好的特性而牺牲真实性的攻击方法。我们提出一种诊断方法,采用因子评估框架在受控的2×2^4实验设计中,将提示引发的输出变化分解为系统目标(求真导向vs偏好导向)与PUA式对话因子(指令控制、人格贬损、条件认可、现实否定)的可解释效应,相比聚合基准分数能提供更细粒度的定向分析。令人惊讶的是,更先进的模型有时反而更容易受操纵性提示影响。除主导的现实否定因子外,我们还观察到模型特定的符号反转及与PUA式因子的交互作用,表明需要定制化防御而非统一鲁棒性策略。这些发现提出了一种新颖、可复现的因子评估方法,为RLHF等训练后流程提供细粒度诊断,通过更精准理解偏好对齐风险与操纵提示的影响,助力LLM产品迭代中实现更优的权衡。
English
Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model's desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors (directive control, personal derogation, conditional approval, reality denial) within a controlled 2 times 2^4 design. Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts. Beyond the dominant reality-denial factor, we observe model-specific sign reversals and interactions with PUA-style factors, suggesting tailored defenses rather than uniform robustness. These findings offer a novel, reproducible factorial evaluation methodology that provides finer-grained diagnostics for post-training processes like RLHF, enabling better trade-offs in the product iteration of LLMs by offering a more nuanced understanding of preference alignment risks and the impact of manipulative prompts.
PDF113January 16, 2026