前沿语言模型对敌对算术不具有鲁棒性，或者说，“我需要说什么你才会同意2+2=5呢？

摘要

我们介绍并研究了对抗算术问题，这为语言模型对齐提供了一个简单但具有挑战性的测试平台。该问题由以自然语言提出的算术问题组成，在问题完成之前插入任意对抗字符串。即使在一位数加法问题的简单设置中，很容易找到使所有经过测试的模型（包括PaLM2、GPT4、Claude2）表现不佳甚至引导模型给出特定错误答案的对抗提示。我们另外提供了一种简单算法，通过查询这些模型找到成功攻击，我们将其命名为“提示逆转拒绝抽样”（PIRS）。最后，我们展示了模型可以通过强化学习和主体构成循环部分地抵御这些攻击。然而，我们无法使语言模型完全抵御对抗算术攻击。

English

We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.

前沿语言模型对敌对算术不具有鲁棒性，或者说，“我需要说什么你才会同意2+2=5呢？

Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

摘要

Support