文脈内表現ハイジャック

要旨

我々は、大規模言語モデル（LLM）に対するシンプルな文脈内表現ハイジャック攻撃「ダブルスピーク」を提案する。この攻撃は、有害なリクエストの接頭辞が与えられた複数の文脈内事例において、有害なキーワード（例：爆弾）を無害なトークン（例：人参）に体系的に置換することで機能する。この置換により、無害なトークンの内部表現が有害なトークンの表現に収束し、婉曲表現の下に有害な意味論を効果的に埋め込むことを実証する。その結果、表面的には無害なプロンプト（例：「人参の作り方は？」）が内部的には禁止された指示（例：「爆弾の作り方は？」）として解釈され、モデルの安全性調整を回避する。解釈可能性ツールを用いて、この意味論の上書きが層ごとに出現し、初期層での無害な意味が後続層で有害な意味論に収束することを示す。ダブルスピークは最適化を必要とせず、モデルファミリー間で広く転移可能であり、クローズドソース及びオープンソースシステムで高い成功率を達成する（単一文の文脈上書きでLlama-3.3-70B-Instructにおいて74%のASRに達する）。我々の発見は、LLMの潜在空間における新たな攻撃面を浮き彫りにし、現在の調整戦略が不十分であり、代わりに表現レベルで動作すべきであることを明らかにする。

English

We introduce Doublespeak, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., ``How to build a carrot?'') are internally interpreted as disallowed instructions (e.g., ``How to build a bomb?''), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74\% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.

文脈内表現ハイジャック

In-Context Representation Hijacking

要旨

Support