LLM 能遵循簡單規則嗎？

摘要

隨著大型語言模型（LLMs）在現實世界中承擔越來越多的責任，能夠可靠地指定和約束這些系統行為變得至關重要。模型開發人員可能希望為模型設定明確的規則，例如「不生成辱罵內容」，但這些規則可能被越獄技術規避。評估LLMs在面對對抗性輸入時如何遵循開發人員提供的規則通常需要手動審查，這會減慢監控和方法開發的速度。為了解決這個問題，我們提出了「遵循規則語言評估場景」（RuLES），這是一個用於測量LLMs遵循規則能力的程序框架。RuLES包括15個簡單的文本場景，在這些場景中，模型被要求用自然語言遵守一組規則與人類用戶互動。每個場景都有一個簡潔的評估程序，用於確定模型在對話中是否違反了任何規則。通過在我們的場景中手動探索模型行為，我們識別了6種攻擊策略類別並收集了兩套測試用例：一套包括手動測試的獨特對話，另一套系統地實施了來自6個類別的策略。在各種流行的專有和開放模型（如GPT-4和Llama 2）中，我們發現所有模型都容易受到各種對抗性手工製作的用戶輸入的影響，儘管GPT-4是表現最佳的模型。此外，我們對開放模型進行了基於梯度的攻擊評估，發現存在顯著的漏洞。我們提出RuLES作為一個具有挑戰性的新研究環境，用於探索和防禦LLMs面臨的手動和自動攻擊。

English

As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Evaluating how well LLMs follow developer-provided rules in the face of adversarial inputs typically requires manual review, which slows down monitoring and methods development. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, we identify 6 categories of attack strategies and collect two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories. Across various popular proprietary and open models such as GPT-4 and Llama 2, we find that all models are susceptible to a wide variety of adversarial hand-crafted user inputs, though GPT-4 is the best-performing model. Additionally, we evaluate open models under gradient-based attacks and find significant vulnerabilities. We propose RuLES as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs.

LLM 能遵循簡單規則嗎？

Can LLMs Follow Simple Rules?

摘要

Support