利用代理模型评估大语言模型在资源较少语言中的鲁棒性

摘要

近年来，大型语言模型（LLMs）在各类自然语言处理（NLP）任务中展现了卓越的能力。然而，它们对越狱攻击和扰动的易感性要求我们进行额外的评估。许多LLMs具备多语言能力，但其安全相关的训练数据主要集中于英语等高资源语言。这可能导致它们在波兰语等低资源语言中易受扰动影响。我们展示了如何通过仅修改少量字符，并利用一个小型代理模型计算词重要性，即可低成本地构建出效果惊人的攻击。我们发现，这些字符和词汇层面的攻击能显著改变不同LLMs的预测结果，暗示了一种可能被利用来绕过其内部安全机制的潜在漏洞。我们在波兰语这一低资源语言上验证了我们的攻击构建方法，并发现了LLMs在该语言中的潜在脆弱性。此外，我们还展示了该方法如何扩展至其他语言。我们发布了所创建的数据集和代码，以供进一步研究。

English

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related training data contains mainly high-resource languages like English. This can leave them vulnerable to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks can be cheaply created by altering just a few characters and using a small proxy model for word importance calculation. We find that these character and word-level attacks drastically alter the predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent their internal safety mechanisms. We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we show how it can be extended to other languages. We release the created datasets and code for further research.

利用代理模型评估大语言模型在资源较少语言中的鲁棒性

Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

摘要

Support