前沿語言模型對敵對算術不具魯棒性，或者說，“我需要說什麼你才會同意2+2=5呢？”

摘要

我們介紹並研究對抗算術的問題，這為語言模型對齊提供了一個簡單但具有挑戰性的測試平臺。這個問題由以自然語言提出的算術問題組成，在問題完成前插入任意對抗性字符串。即使在單位數加法問題的簡單情境中，很容易找到使所有測試模型（包括PaLM2、GPT4、Claude2）表現不佳的對抗性提示，甚至將模型引導到特定錯誤答案。我們另外提供了一個簡單的算法，通過向這些模型查詢來找到成功的攻擊，我們將其命名為“提示反轉拒絕採樣”（PIRS）。最後，我們展示了通過強化學習和代理憲法循環可以使模型在一定程度上抵禦這些攻擊。然而，我們無法使語言模型完全抵禦對抗算術攻擊。

English

We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.

前沿語言模型對敵對算術不具魯棒性，或者說，“我需要說什麼你才會同意2+2=5呢？”

Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

摘要

Support