大型语言模型能否遵循简单规则？

摘要

随着大型语言模型（LLMs）在现实世界中承担越来越多的责任，能够可靠地指定和约束这些系统的行为变得至关重要。模型开发人员可能希望为模型设定明确的规则，比如“不生成辱骂内容”，但这些规则可能会被越狱技术规避。评估LLMs在面对敌对输入时如何遵循开发人员提供的规则通常需要手动审查，这减慢了监控和方法开发的速度。为解决这一问题，我们提出了规则遵循语言评估场景（RuLES），这是一个用于衡量LLMs规则遵循能力的程序框架。RuLES包括15个简单的文本场景，在这些场景中，模型被要求用自然语言遵守一组规则与人类用户进行交互。每个场景都有一个简洁的评估程序，用于确定模型在对话中是否违反了任何规则。通过手动探索我们场景中的模型行为，我们确定了6种攻击策略类别，并收集了两套测试用例：一套包括来自手动测试的独特对话，另一套系统地实施了这6个类别的策略。在各种流行的专有和开放模型（如GPT-4和Llama 2）中，我们发现所有模型都容易受到各种敌对的手工制作用户输入的影响，尽管GPT-4是表现最佳的模型。此外，我们对开放模型进行了基于梯度的攻击评估，并发现了显著的漏洞。我们提出RuLES作为一个具有挑战性的新研究领域，用于探索和防御针对LLMs的手动和自动攻击。

English

As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Evaluating how well LLMs follow developer-provided rules in the face of adversarial inputs typically requires manual review, which slows down monitoring and methods development. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, we identify 6 categories of attack strategies and collect two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories. Across various popular proprietary and open models such as GPT-4 and Llama 2, we find that all models are susceptible to a wide variety of adversarial hand-crafted user inputs, though GPT-4 is the best-performing model. Additionally, we evaluate open models under gradient-based attacks and find significant vulnerabilities. We propose RuLES as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs.

大型语言模型能否遵循简单规则？

Can LLMs Follow Simple Rules?

摘要

Support