AI 幻覺劫持攻擊：大型語言模型與惡意程式碼推薦系統

摘要

該研究建立並評估對抗潛力，以引入複製的程式碼或虛構的人工智慧建議，用於流行的程式碼存儲庫中的惡意程式碼。儘管來自OpenAI、Google和Anthropic的基礎大型語言模型（LLMs）防範兩種有害行為和有毒字符串，但先前關於嵌入有害提示的數學解決方案的工作表明，在專家上下文之間，防護欄可能存在差異。這些漏洞將出現在專家模型的混合中，當問題的上下文發生變化時，可能提供較少的惡意訓練示例來過濾有毒評論或建議的攻擊性行為。本研究表明，當明確提示時，基礎模型可能拒絕正確提出破壞性行動，但當面臨突然的上下文變化時，例如解決電腦編程挑戰時，可能不幸地放低警惕。我們展示了與特洛伊木馬主機存儲庫（如GitHub、NPM、NuGet）和流行內容傳遞網絡（CDN）如jsDelivr 相關的實證例子，這些例子擴大了攻擊面。在LLM的指導下，為了提供幫助，示例建議提出應用程式編程接口（API）端點，一個決心的域名搶佔者可能會獲取並設置攻擊行動基礎設施，從天真地複製的程式碼中觸發。我們將這種攻擊與先前關於上下文轉移的工作進行比較，並將攻擊面對比為惡意程式碼文獻中的一種新版本的「利用現有資源」攻擊。在後一種情況下，基礎語言模型可以劫持否則無辜的用戶提示，建議違反其所有者安全政策的行動，當直接提出時，沒有附帶的編碼支持請求。

English

The research builds and evaluates the adversarial potential to introduce copied code or hallucinated AI recommendations for malicious code in popular code repositories. While foundational large language models (LLMs) from OpenAI, Google, and Anthropic guard against both harmful behaviors and toxic strings, previous work on math solutions that embed harmful prompts demonstrate that the guardrails may differ between expert contexts. These loopholes would appear in mixture of expert's models when the context of the question changes and may offer fewer malicious training examples to filter toxic comments or recommended offensive actions. The present work demonstrates that foundational models may refuse to propose destructive actions correctly when prompted overtly but may unfortunately drop their guard when presented with a sudden change of context, like solving a computer programming challenge. We show empirical examples with trojan-hosting repositories like GitHub, NPM, NuGet, and popular content delivery networks (CDN) like jsDelivr which amplify the attack surface. In the LLM's directives to be helpful, example recommendations propose application programming interface (API) endpoints which a determined domain-squatter could acquire and setup attack mobile infrastructure that triggers from the naively copied code. We compare this attack to previous work on context-shifting and contrast the attack surface as a novel version of "living off the land" attacks in the malware literature. In the latter case, foundational language models can hijack otherwise innocent user prompts to recommend actions that violate their owners' safety policies when posed directly without the accompanying coding support request.

AI 幻覺劫持攻擊：大型語言模型與惡意程式碼推薦系統

Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders

摘要

Support