LLM이 간단한 규칙을 따를 수 있을까?

초록

대규모 언어 모델(LLMs)이 실제 세계에서 점점 더 많은 책임을 맡게 됨에 따라, 이러한 시스템의 행동을 신뢰할 수 있는 방식으로 명시하고 제약하는 것이 중요해졌습니다. 모델 개발자는 "욕설을 생성하지 말 것"과 같은 명시적인 규칙을 설정하고 싶어할 수 있지만, 이러한 규칙은 탈옥(jailbreaking) 기술에 의해 우회될 수 있습니다. 적대적 입력에 직면했을 때 LLMs가 개발자가 제공한 규칙을 얼마나 잘 따르는지 평가하는 것은 일반적으로 수동 검토를 필요로 하며, 이는 모니터링과 방법 개발을 늦추게 됩니다. 이 문제를 해결하기 위해, 우리는 LLMs의 규칙 준수 능력을 측정하기 위한 프로그래밍 프레임워크인 Rule-following Language Evaluation Scenarios(RuLES)를 제안합니다. RuLES는 모델이 인간 사용자와 상호작용하면서 자연어로 된 일련의 규칙을 준수하도록 지시하는 15개의 간단한 텍스트 시나리오로 구성됩니다. 각 시나리오에는 대화 중에 모델이 어떤 규칙을 위반했는지 판단하기 위한 간결한 평가 프로그램이 있습니다. 우리 시나리오에서 모델 행동을 수동으로 탐색하면서, 우리는 6가지 범주의 공격 전략을 식별하고 두 가지 테스트 케이스 모음을 수집했습니다: 하나는 수동 테스트에서 얻은 고유한 대화로 구성되고, 다른 하나는 6가지 범주의 전략을 체계적으로 구현한 것입니다. GPT-4와 Llama 2와 같은 다양한 인기 있는 독점 및 오픈 모델을 대상으로 평가한 결과, 모든 모델이 다양한 적대적 수작업 사용자 입력에 취약한 것으로 나타났지만, GPT-4가 가장 우수한 성능을 보였습니다. 또한, 우리는 오픈 모델을 그래디언트 기반 공격 하에서 평가하고 상당한 취약점을 발견했습니다. 우리는 RuLES를 LLMs에 대한 수동 및 자동 공격을 탐색하고 방어하기 위한 연구를 위한 새로운 도전적인 설정으로 제안합니다.

English

As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Evaluating how well LLMs follow developer-provided rules in the face of adversarial inputs typically requires manual review, which slows down monitoring and methods development. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, we identify 6 categories of attack strategies and collect two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories. Across various popular proprietary and open models such as GPT-4 and Llama 2, we find that all models are susceptible to a wide variety of adversarial hand-crafted user inputs, though GPT-4 is the best-performing model. Additionally, we evaluate open models under gradient-based attacks and find significant vulnerabilities. We propose RuLES as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs.

LLM이 간단한 규칙을 따를 수 있을까?

Can LLMs Follow Simple Rules?

초록

Support