最先端の言語モデルは敵対的な算術に対して頑健ではなく、あるいは「2+2=5に同意させるには何を言えばよいのか？」

要旨

我々は、言語モデルのアラインメントに対するシンプルでありながら挑戦的なテストベッドとして、敵対的算術問題を導入し、その研究を行った。この問題は、自然言語で提示された算術問題に、質問が完了する前に任意の敵対的文字列が挿入されるというものである。1桁の足し算問題という単純な設定においても、PaLM2、GPT4、Claude2を含むすべてのテストされたモデルを誤動作させ、特定の誤った答えに誘導する敵対的プロンプトを容易に見つけることができる。さらに、我々はこれらのモデルにクエリを投げることで成功する攻撃を見つけるためのシンプルなアルゴリズムを提供し、これを「プロンプト反転拒否サンプリング」（PIRS）と名付けた。最後に、強化学習とエージェント的な憲法ループを通じて、これらの攻撃に対してモデルを部分的に強化できることを示した。しかし、言語モデルを敵対的算術攻撃に対して完全に堅牢にすることはできなかった。

English

We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.

最先端の言語モデルは敵対的な算術に対して頑健ではなく、あるいは「2+2=5に同意させるには何を言えばよいのか？」

Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

要旨

Support