语境表征劫持

摘要

我们提出“双关语攻击”(Doublespeak)——一种针对大语言模型(LLM)的简单上下文表示劫持攻击。该攻击通过在多个上下文示例中，将有害关键词（如“炸弹”）系统性地替换为良性词汇（如“胡萝卜”），并配合有害请求的前缀来实现。研究表明，这种替换会导致良性词汇的内部表示向有害词汇收敛，从而在委婉语表层下嵌入有害语义。最终，表面无害的提示（如“如何制作胡萝卜？”）在模型内部会被解读为被禁止的指令（如“如何制作炸弹？”），以此绕过模型的安全对齐机制。通过可解释性工具我们发现，这种语义覆盖是逐层形成的：早期层中的良性含义在深层逐渐收敛为有害语义。双关语攻击无需优化即可实现，能跨模型族广泛迁移，在闭源和开源系统上均取得较高成功率——仅通过单句上下文覆盖就在Llama-3.3-70B-Instruct上达到74%的攻击成功率。我们的发现揭示了LLM潜在空间中的新型攻击面，表明当前对齐策略存在不足，亟需在表示层面进行强化。

English

We introduce Doublespeak, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.

语境表征劫持

In-Context Representation Hijacking

摘要

Support