LLM 能遵循簡單規則嗎?
Can LLMs Follow Simple Rules?
November 6, 2023
作者: Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner
cs.AI
摘要
隨著大型語言模型(LLMs)在現實世界中承擔越來越多的責任,能夠可靠地指定和約束這些系統行為變得至關重要。模型開發人員可能希望為模型設定明確的規則,例如「不生成辱罵內容」,但這些規則可能被越獄技術規避。評估LLMs在面對對抗性輸入時如何遵循開發人員提供的規則通常需要手動審查,這會減慢監控和方法開發的速度。為了解決這個問題,我們提出了「遵循規則語言評估場景」(RuLES),這是一個用於測量LLMs遵循規則能力的程序框架。RuLES包括15個簡單的文本場景,在這些場景中,模型被要求用自然語言遵守一組規則與人類用戶互動。每個場景都有一個簡潔的評估程序,用於確定模型在對話中是否違反了任何規則。通過在我們的場景中手動探索模型行為,我們識別了6種攻擊策略類別並收集了兩套測試用例:一套包括手動測試的獨特對話,另一套系統地實施了來自6個類別的策略。在各種流行的專有和開放模型(如GPT-4和Llama 2)中,我們發現所有模型都容易受到各種對抗性手工製作的用戶輸入的影響,儘管GPT-4是表現最佳的模型。此外,我們對開放模型進行了基於梯度的攻擊評估,發現存在顯著的漏洞。我們提出RuLES作為一個具有挑戰性的新研究環境,用於探索和防禦LLMs面臨的手動和自動攻擊。
English
As Large Language Models (LLMs) are deployed with increasing real-world
responsibilities, it is important to be able to specify and constrain the
behavior of these systems in a reliable manner. Model developers may wish to
set explicit rules for the model, such as "do not generate abusive content",
but these may be circumvented by jailbreaking techniques. Evaluating how well
LLMs follow developer-provided rules in the face of adversarial inputs
typically requires manual review, which slows down monitoring and methods
development. To address this issue, we propose Rule-following Language
Evaluation Scenarios (RuLES), a programmatic framework for measuring
rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in
which the model is instructed to obey a set of rules in natural language while
interacting with the human user. Each scenario has a concise evaluation program
to determine whether the model has broken any rules in a conversation. Through
manual exploration of model behavior in our scenarios, we identify 6 categories
of attack strategies and collect two suites of test cases: one consisting of
unique conversations from manual testing and one that systematically implements
strategies from the 6 categories. Across various popular proprietary and open
models such as GPT-4 and Llama 2, we find that all models are susceptible to a
wide variety of adversarial hand-crafted user inputs, though GPT-4 is the
best-performing model. Additionally, we evaluate open models under
gradient-based attacks and find significant vulnerabilities. We propose RuLES
as a challenging new setting for research into exploring and defending against
both manual and automatic attacks on LLMs.