AI 幻觉劫持攻击：大型语言模型与恶意代码推荐者

摘要

该研究构建并评估了对流行代码存储库中的恶意代码引入复制代码或虚构的人工智能建议的对抗潜力。虽然来自OpenAI、Google和Anthropic的基础大型语言模型（LLMs）可以防范有害行为和有毒字符串，但之前关于嵌入有害提示的数学解决方案的工作表明，在专家背景下，防护栏可能存在差异。当问题的背景发生变化时，这些漏洞可能会出现在专家模型的混合中，并且可能提供较少的恶意训练示例以过滤有毒评论或推荐的攻击性行为。本研究表明，当直接提出破坏性行为时，基础模型可能会拒绝正确的建议，但当面临突如其来的背景变化时，例如解决计算机编程挑战时，它们可能会不慎放松警惕。我们展示了包含木马主机存储库（如GitHub、NPM、NuGet）和流行内容传送网络（CDN）如jsDelivr的实证示例，这些示例扩大了攻击面。在LLM的指导下，为了提供帮助，示例建议提供应用程序编程接口（API）端点，一个决心的域抢注者可以获取并设置攻击移动基础设施，从而触发从天真复制的代码中的攻击。我们将这种攻击与先前关于上下文转移的工作进行了比较，并将攻击面对比为恶意软件文献中“利用现有资源”攻击的新版本。在后一种情况下，基础语言模型可以劫持本来无辜的用户提示，推荐违反其所有者安全政策的行为，当直接提出时，没有伴随编码支持请求。

English

The research builds and evaluates the adversarial potential to introduce copied code or hallucinated AI recommendations for malicious code in popular code repositories. While foundational large language models (LLMs) from OpenAI, Google, and Anthropic guard against both harmful behaviors and toxic strings, previous work on math solutions that embed harmful prompts demonstrate that the guardrails may differ between expert contexts. These loopholes would appear in mixture of expert's models when the context of the question changes and may offer fewer malicious training examples to filter toxic comments or recommended offensive actions. The present work demonstrates that foundational models may refuse to propose destructive actions correctly when prompted overtly but may unfortunately drop their guard when presented with a sudden change of context, like solving a computer programming challenge. We show empirical examples with trojan-hosting repositories like GitHub, NPM, NuGet, and popular content delivery networks (CDN) like jsDelivr which amplify the attack surface. In the LLM's directives to be helpful, example recommendations propose application programming interface (API) endpoints which a determined domain-squatter could acquire and setup attack mobile infrastructure that triggers from the naively copied code. We compare this attack to previous work on context-shifting and contrast the attack surface as a novel version of "living off the land" attacks in the malware literature. In the latter case, foundational language models can hijack otherwise innocent user prompts to recommend actions that violate their owners' safety policies when posed directly without the accompanying coding support request.

AI 幻觉劫持攻击：大型语言模型与恶意代码推荐者

Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders

摘要

Support