大規模言語モデルの指示追従評価

要旨

大規模言語モデル（LLM）の中核的な能力の一つは、自然言語の指示に従うことです。しかし、このような能力の評価は標準化されていません。人間による評価は高コストで時間がかかり、客観的に再現可能ではありません。一方、LLMベースの自動評価は、評価用LLMの能力に制約されるか、バイアスがかかる可能性があります。これらの課題を克服するため、我々は大規模言語モデル向けの指示追従評価（Instruction-Following Eval, IFEval）を導入します。IFEvalは、シンプルで再現が容易な評価ベンチマークです。これは「400語以上で書く」や「キーワードAIを少なくとも3回言及する」といった「検証可能な指示」のセットに焦点を当てています。我々は25種類の検証可能な指示を特定し、約500のプロンプトを構築しました。各プロンプトには1つ以上の検証可能な指示が含まれています。市場で広く利用可能な2つのLLMの評価結果を示します。コードとデータはhttps://github.com/google-research/google-research/tree/master/instruction_following_evalで公開されています。

English

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

大規模言語モデルの指示追従評価

Instruction-Following Evaluation for Large Language Models

要旨

Support