プロキシモデルを用いた低リソース言語における大規模言語モデルの頑健性評価

要旨

大規模言語モデル（LLMs）は、近年、さまざまな自然言語処理（NLP）タスクにおいて印象的な能力を発揮してきた。しかし、その脆弱性がジャイルブレイクや摂動に対して顕在化しており、追加の評価が必要とされている。多くのLLMsは多言語対応であるが、安全性に関連するトレーニングデータは主に英語などの高リソース言語で構成されている。これにより、ポーランド語などの低リソース言語における摂動に対して脆弱性が生じる可能性がある。本研究では、わずかな文字の変更と、単語の重要度計算のための小さなプロキシモデルを使用することで、驚くほど強力な攻撃を低コストで作成できることを示す。これらの文字および単語レベルの攻撃が、異なるLLMsの予測を劇的に変化させることを明らかにし、内部の安全メカニズムを回避するための潜在的な脆弱性を示唆する。我々は、低リソース言語であるポーランド語において攻撃構築手法を検証し、この言語におけるLLMsの潜在的な脆弱性を発見した。さらに、この手法が他の言語にも拡張可能であることを示す。今後の研究のために、作成したデータセットとコードを公開する。

English

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related training data contains mainly high-resource languages like English. This can leave them vulnerable to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks can be cheaply created by altering just a few characters and using a small proxy model for word importance calculation. We find that these character and word-level attacks drastically alter the predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent their internal safety mechanisms. We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we show how it can be extended to other languages. We release the created datasets and code for further research.

プロキシモデルを用いた低リソース言語における大規模言語モデルの頑健性評価

Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

要旨

Support