大型語言模型的指令遵循評估
Instruction-Following Evaluation for Large Language Models
November 14, 2023
作者: Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou
cs.AI
摘要
大型語言模型(LLMs)的一項核心能力是遵循自然語言指令。然而,對此類能力的評估並未標準化:人類評估昂貴、緩慢,且缺乏客觀可重複性,而基於LLM的自動評估可能存在偏見,或受評估LLM的能力所限制。為了克服這些問題,我們引入了用於大型語言模型的指令遵循評估(IFEval)。IFEval是一個直觀且易於重現的評估基準。它專注於一組“可驗證指令”,如“撰寫超過400字”和“至少提到AI關鍵詞3次”。我們確定了25種這些可驗證指令,並構建了約500個提示,每個提示包含一個或多個可驗證指令。我們展示了市場上兩款廣泛可用的LLM的評估結果。我們的程式碼和數據可在https://github.com/google-research/google-research/tree/master/instruction_following_eval 找到。
English
One core capability of Large Language Models (LLMs) is to follow natural
language instructions. However, the evaluation of such abilities is not
standardized: Human evaluations are expensive, slow, and not objectively
reproducible, while LLM-based auto-evaluation is potentially biased or limited
by the ability of the evaluator LLM. To overcome these issues, we introduce
Instruction-Following Eval (IFEval) for large language models. IFEval is a
straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set
of "verifiable instructions" such as "write in more than 400 words" and
"mention the keyword of AI at least 3 times". We identified 25 types of those
verifiable instructions and constructed around 500 prompts, with each prompt
containing one or more verifiable instructions. We show evaluation results of
two widely available LLMs on the market. Our code and data can be found at
https://github.com/google-research/google-research/tree/master/instruction_following_eval