저자원 언어에서 프록시 모델을 활용한 대형 언어 모델의 견고성 평가

초록

대규모 언어 모델(LLM)은 최근 몇 년 동안 다양한 자연어 처리(NLP) 작업에서 인상적인 성능을 보여주었다. 그러나 이러한 모델들은 탈옥(jailbreak) 및 교란(perturbation)에 취약하기 때문에 추가적인 평가가 필요하다. 많은 LLM이 다국어를 지원하지만, 안전 관련 훈련 데이터는 주로 영어와 같은 고자원 언어로 구성되어 있다. 이로 인해 폴란드어와 같은 저자원 언어에서의 교란에 취약할 수 있다. 본 연구에서는 단 몇 개의 문자를 변경하고 단어 중요도 계산을 위한 작은 프록시 모델을 사용하여 놀라울 정도로 강력한 공격을 저렴하게 생성할 수 있음을 보여준다. 이러한 문자 및 단어 수준의 공격이 다양한 LLM의 예측을 크게 변경함으로써, 내부 안전 메커니즘을 우회하는 데 활용될 수 있는 잠재적 취약점을 시사한다. 저자원 언어인 폴란드어를 대상으로 공격 구성 방법론을 검증하고, 이 언어에서의 LLM 잠재적 취약점을 발견하였다. 또한, 이를 다른 언어로 확장할 수 있는 방법을 제시한다. 본 연구에서는 생성된 데이터셋과 코드를 공개하여 추가 연구를 촉진한다.

English

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related training data contains mainly high-resource languages like English. This can leave them vulnerable to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks can be cheaply created by altering just a few characters and using a small proxy model for word importance calculation. We find that these character and word-level attacks drastically alter the predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent their internal safety mechanisms. We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we show how it can be extended to other languages. We release the created datasets and code for further research.

저자원 언어에서 프록시 모델을 활용한 대형 언어 모델의 견고성 평가

Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

초록

Support