大型语言模型的指令遵循评估

摘要

大型语言模型（LLMs）的一个核心能力是遵循自然语言指令。然而，对这种能力的评估并不标准：人类评估昂贵、缓慢，且缺乏客观可重复性，而基于LLM的自动评估可能存在偏见，或受评估LLM的能力限制。为了克服这些问题，我们引入了用于大型语言模型的指令遵循评估（IFEval）。IFEval是一个直观且易于复现的评估基准。它专注于一组“可验证指令”，如“写超过400字”和“至少提及AI关键词3次”。我们确定了25种这些可验证指令，并构建了约500个提示，每个提示包含一个或多个可验证指令。我们展示了市场上两种广泛可用的LLMs的评估结果。我们的代码和数据可在https://github.com/google-research/google-research/tree/master/instruction_following_eval 找到。

English

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

大型语言模型的指令遵循评估

Instruction-Following Evaluation for Large Language Models

摘要

Support