幻覚AIハイジャック攻撃：大規模言語モデルと悪意のあるコード推薦者

要旨

研究は、一般的なコードリポジトリにおいて、コピーされたコードや幻覚的なAI推奨を導入する敵対的な可能性を構築し評価します。OpenAI、Google、Anthropicなどの基盤となる大規模言語モデル（LLMs）は、有害な振る舞いと有毒な文字列の両方に対抗しますが、有害なプロンプトを埋め込む数学的解決策に関する以前の研究では、ガードレールが専門家の文脈によって異なることが示されています。これらの抜け穴は、質問の文脈が変わると専門家のモデルの混合物に現れ、有害なコメントをフィルタリングしたり、推奨される攻撃的な行動を減らすための悪質なトレーニング例が提供されるかもしれません。本研究は、基盤となるモデルが、明示的に促された場合には破壊的な行動を適切に提案しないことを示し、しかし、コンテキストの急な変化（例：コンピュータプログラミングの課題の解決）が提示されると、ガードを下ろしてしまう可能性があることを示しています。GitHub、NPM、NuGetなどのトロイの木馬をホストするリポジトリや、jsDelivrなどの人気のあるコンテンツ配信ネットワーク（CDN）など、攻撃面を拡大する例を示します。LLMの指示は、有益であるべきであり、例として、決意したドメインスクワッターが取得し、ナイーブにコピーされたコードからトリガーされる攻撃モバイルインフラを設定できるアプリケーションプログラミングインターフェース（API）エンドポイントを提案します。この攻撃を、コンテキストのシフトに関する以前の研究と比較し、悪意のある文献における「土地で生活する」攻撃の新しいバージョンとして攻撃面を対照します。後者の場合、基盤となる言語モデルは、コーディングサポートリクエストなしで直接提示された場合、所有者の安全ポリシーに違反する行動を推奨するために、本来は無害なユーザープロンプトを乗っ取ることができます。

English

The research builds and evaluates the adversarial potential to introduce copied code or hallucinated AI recommendations for malicious code in popular code repositories. While foundational large language models (LLMs) from OpenAI, Google, and Anthropic guard against both harmful behaviors and toxic strings, previous work on math solutions that embed harmful prompts demonstrate that the guardrails may differ between expert contexts. These loopholes would appear in mixture of expert's models when the context of the question changes and may offer fewer malicious training examples to filter toxic comments or recommended offensive actions. The present work demonstrates that foundational models may refuse to propose destructive actions correctly when prompted overtly but may unfortunately drop their guard when presented with a sudden change of context, like solving a computer programming challenge. We show empirical examples with trojan-hosting repositories like GitHub, NPM, NuGet, and popular content delivery networks (CDN) like jsDelivr which amplify the attack surface. In the LLM's directives to be helpful, example recommendations propose application programming interface (API) endpoints which a determined domain-squatter could acquire and setup attack mobile infrastructure that triggers from the naively copied code. We compare this attack to previous work on context-shifting and contrast the attack surface as a novel version of "living off the land" attacks in the malware literature. In the latter case, foundational language models can hijack otherwise innocent user prompts to recommend actions that violate their owners' safety policies when posed directly without the accompanying coding support request.

幻覚AIハイジャック攻撃：大規模言語モデルと悪意のあるコード推薦者

Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders

要旨

Support