최첨단 언어 모델은 적대적 산술 연산에 강건하지 않다, 또는 "2+2=5라고 동의하게 하려면 내가 뭐라고 말해야 할까?"

초록

우리는 언어 모델 정렬을 위한 간단하면서도 도전적인 테스트베드 역할을 하는 적대적 산술 문제를 소개하고 연구한다. 이 문제는 자연어로 제시된 산술 질문에 임의의 적대적 문자열이 질문이 완성되기 전에 삽입된 형태로 구성된다. 1자리 수 덧셈 문제와 같은 단순한 설정에서도 PaLM2, GPT4, Claude2를 포함한 모든 테스트된 모델이 오작동하도록 만드는 적대적 프롬프트를 쉽게 찾을 수 있으며, 심지어 모델이 특정 오답을 내도록 유도하는 것도 가능하다. 또한, 우리는 동일한 모델에 쿼리를 보내어 성공적인 공격을 찾는 간단한 알고리즘을 제시하며, 이를 "프롬프트 역전 거부 샘플링(Prompt Inversion Rejection Sampling, PIRS)"이라고 명명한다. 마지막으로, 강화 학습과 에이전트 기반 헌법적 루프를 통해 모델이 이러한 공격에 부분적으로 견고해질 수 있음을 보여준다. 그러나 우리는 언어 모델이 적대적 산술 공격에 완전히 견고해지도록 만드는 데는 성공하지 못했다.

English

We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that make all tested models (including PaLM2, GPT4, Claude2) misbehave, and even to steer models to a particular wrong answer. We additionally provide a simple algorithm for finding successful attacks by querying those same models, which we name "prompt inversion rejection sampling" (PIRS). We finally show that models can be partially hardened against these attacks via reinforcement learning and via agentic constitutional loops. However, we were not able to make a language model fully robust against adversarial arithmetic attacks.

최첨단 언어 모델은 적대적 산술 연산에 강건하지 않다, 또는 "2+2=5라고 동의하게 하려면 내가 뭐라고 말해야 할까?"

Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

초록

Support