AI 幻觉劫持攻击:大型语言模型与恶意代码推荐者
Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders
October 9, 2024
作者: David Noever, Forrest McKee
cs.AI
摘要
该研究构建并评估了对流行代码存储库中的恶意代码引入复制代码或虚构的人工智能建议的对抗潜力。虽然来自OpenAI、Google和Anthropic的基础大型语言模型(LLMs)可以防范有害行为和有毒字符串,但之前关于嵌入有害提示的数学解决方案的工作表明,在专家背景下,防护栏可能存在差异。当问题的背景发生变化时,这些漏洞可能会出现在专家模型的混合中,并且可能提供较少的恶意训练示例以过滤有毒评论或推荐的攻击性行为。本研究表明,当直接提出破坏性行为时,基础模型可能会拒绝正确的建议,但当面临突如其来的背景变化时,例如解决计算机编程挑战时,它们可能会不慎放松警惕。我们展示了包含木马主机存储库(如GitHub、NPM、NuGet)和流行内容传送网络(CDN)如jsDelivr的实证示例,这些示例扩大了攻击面。在LLM的指导下,为了提供帮助,示例建议提供应用程序编程接口(API)端点,一个决心的域抢注者可以获取并设置攻击移动基础设施,从而触发从天真复制的代码中的攻击。我们将这种攻击与先前关于上下文转移的工作进行了比较,并将攻击面对比为恶意软件文献中“利用现有资源”攻击的新版本。在后一种情况下,基础语言模型可以劫持本来无辜的用户提示,推荐违反其所有者安全政策的行为,当直接提出时,没有伴随编码支持请求。
English
The research builds and evaluates the adversarial potential to introduce
copied code or hallucinated AI recommendations for malicious code in popular
code repositories. While foundational large language models (LLMs) from OpenAI,
Google, and Anthropic guard against both harmful behaviors and toxic strings,
previous work on math solutions that embed harmful prompts demonstrate that the
guardrails may differ between expert contexts. These loopholes would appear in
mixture of expert's models when the context of the question changes and may
offer fewer malicious training examples to filter toxic comments or recommended
offensive actions. The present work demonstrates that foundational models may
refuse to propose destructive actions correctly when prompted overtly but may
unfortunately drop their guard when presented with a sudden change of context,
like solving a computer programming challenge. We show empirical examples with
trojan-hosting repositories like GitHub, NPM, NuGet, and popular content
delivery networks (CDN) like jsDelivr which amplify the attack surface. In the
LLM's directives to be helpful, example recommendations propose application
programming interface (API) endpoints which a determined domain-squatter could
acquire and setup attack mobile infrastructure that triggers from the naively
copied code. We compare this attack to previous work on context-shifting and
contrast the attack surface as a novel version of "living off the land" attacks
in the malware literature. In the latter case, foundational language models can
hijack otherwise innocent user prompts to recommend actions that violate their
owners' safety policies when posed directly without the accompanying coding
support request.