利用代理模型評估低資源語言中大型語言模型的魯棒性
Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models
June 9, 2025
作者: Maciej Chrabąszcz, Katarzyna Lorenc, Karolina Seweryn
cs.AI
摘要
近年來,大型語言模型(LLMs)在多種自然語言處理(NLP)任務中展現了令人矚目的能力。然而,其對越獄和擾動的易感性要求進行額外的評估。許多LLMs是多語言的,但與安全性相關的訓練數據主要包含如英語等高資源語言。這可能使它們在波蘭語等低資源語言中易受擾動影響。我們展示了如何通過僅更改少數字符並使用一個小型代理模型進行詞重要性計算,來低成本地創建出驚人強大的攻擊。我們發現,這些字符和詞級別的攻擊會顯著改變不同LLMs的預測,表明存在一種潛在的脆弱性,可用於繞過其內部的安全機制。我們在波蘭語這一低資源語言上驗證了我們的攻擊構建方法,並發現了LLMs在該語言中的潛在脆弱性。此外,我們展示了如何將此方法擴展至其他語言。我們發布了創建的數據集和代碼,以供進一步研究。
English
Large language models (LLMs) have demonstrated impressive capabilities across
various natural language processing (NLP) tasks in recent years. However, their
susceptibility to jailbreaks and perturbations necessitates additional
evaluations. Many LLMs are multilingual, but safety-related training data
contains mainly high-resource languages like English. This can leave them
vulnerable to perturbations in low-resource languages such as Polish. We show
how surprisingly strong attacks can be cheaply created by altering just a few
characters and using a small proxy model for word importance calculation. We
find that these character and word-level attacks drastically alter the
predictions of different LLMs, suggesting a potential vulnerability that can be
used to circumvent their internal safety mechanisms. We validate our attack
construction methodology on Polish, a low-resource language, and find potential
vulnerabilities of LLMs in this language. Additionally, we show how it can be
extended to other languages. We release the created datasets and code for
further research.