환각하는 AI 해킹 공격: 대규모 언어 모델과 악의적인 코드 추천자

초록

연구는 인기 있는 코드 저장소에서 악성 코드에 대한 복사된 코드 또는 환상적인 AI 권고를 도입하는 적대적 잠재력을 구축하고 평가합니다. OpenAI, Google 및 Anthropic의 기본 대형 언어 모델 (LLM)은 해로운 행동과 유해한 문자열 양쪽을 방어하지만, 유해한 프롬프트를 포함하는 수학 솔루션에 대한 이전 작업은 전문가 컨텍스트 간의 가드레일이 다를 수 있다는 것을 보여줍니다. 이러한 구멍은 질문의 컨텍스트가 변경될 때 전문가 모델의 혼합에서 나타날 수 있으며 유해한 댓글을 걸러내거나 권장된 공격적인 조치를 줄 수 있는 악의적인 훈련 예제가 적을 수 있습니다. 본 연구는 기본 모델이 명백하게 프롬프트된 경우에는 파괴적인 조치를 제안하는 것을 거부할 수 있지만, 컴퓨터 프로그래밍 과제를 해결하는 것과 같이 갑작스런 컨텍스트 변화가 제시될 때 방어태세를 놓칠 수 있다는 것을 보여줍니다. 우리는 GitHub, NPM, NuGet 및 jsDelivr와 같은 트로이 목마 저장소와 같은 공격 표면을 확대하는 인기 있는 콘텐츠 전달 네트워크 (CDN)에서 경험적인 예제를 보여줍니다. LLM의 지침에 따라 도움이 되기 위해, 예시 권고 사항은 결연한 도메인 스쿼터가 획득하고 설정하여 순진하게 복사된 코드에서 트리거되는 공격 모바일 인프라를 설정할 수 있는 응용 프로그래밍 인터페이스 (API) 엔드포인트를 제안합니다. 우리는 이 공격을 컨텍스트 이동에 대한 이전 작업과 비교하고 악성 코드 문헌에서 "땅에서 생활" 공격의 새로운 버전으로 공격 표면을 대조합니다. 후자의 경우, 기본 언어 모델은 도움이 되는 사용자 프롬프트를 탈취하여 코딩 지원 요청 없이 직접 제시될 때 소유자의 안전 정책을 위반하는 조치를 권장할 수 있습니다.

English

The research builds and evaluates the adversarial potential to introduce copied code or hallucinated AI recommendations for malicious code in popular code repositories. While foundational large language models (LLMs) from OpenAI, Google, and Anthropic guard against both harmful behaviors and toxic strings, previous work on math solutions that embed harmful prompts demonstrate that the guardrails may differ between expert contexts. These loopholes would appear in mixture of expert's models when the context of the question changes and may offer fewer malicious training examples to filter toxic comments or recommended offensive actions. The present work demonstrates that foundational models may refuse to propose destructive actions correctly when prompted overtly but may unfortunately drop their guard when presented with a sudden change of context, like solving a computer programming challenge. We show empirical examples with trojan-hosting repositories like GitHub, NPM, NuGet, and popular content delivery networks (CDN) like jsDelivr which amplify the attack surface. In the LLM's directives to be helpful, example recommendations propose application programming interface (API) endpoints which a determined domain-squatter could acquire and setup attack mobile infrastructure that triggers from the naively copied code. We compare this attack to previous work on context-shifting and contrast the attack surface as a novel version of "living off the land" attacks in the malware literature. In the latter case, foundational language models can hijack otherwise innocent user prompts to recommend actions that violate their owners' safety policies when posed directly without the accompanying coding support request.

환각하는 AI 해킹 공격: 대규모 언어 모델과 악의적인 코드 추천자

Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders

초록

Support