대규모 언어 모델의 지시사항 수행 능력 평가

초록

대규모 언어 모델(LLM)의 핵심 기능 중 하나는 자연어 지시를 따르는 것입니다. 그러나 이러한 능력을 평가하는 방법은 표준화되어 있지 않습니다. 인간 평가는 비용이 많이 들고 느릴 뿐만 아니라 객관적으로 재현하기 어렵습니다. 반면, LLM 기반 자동 평가는 평가자 LLM의 능력에 의해 편향되거나 제한될 가능성이 있습니다. 이러한 문제를 해결하기 위해, 우리는 대규모 언어 모델을 위한 지시 따르기 평가(Instruction-Following Eval, IFEval)를 소개합니다. IFEval은 간단하고 재현하기 쉬운 평가 벤치마크입니다. 이는 "400단어 이상으로 작성하라" 또는 "AI 키워드를 최소 3번 언급하라"와 같은 "검증 가능한 지시" 집합에 초점을 맞춥니다. 우리는 이러한 검증 가능한 지시 25가지 유형을 식별하고, 각 프롬프트가 하나 이상의 검증 가능한 지시를 포함하도록 약 500개의 프롬프트를 구성했습니다. 우리는 시장에서 널리 사용 가능한 두 가지 LLM의 평가 결과를 보여줍니다. 우리의 코드와 데이터는 https://github.com/google-research/google-research/tree/master/instruction_following_eval에서 확인할 수 있습니다.

English

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval

대규모 언어 모델의 지시사항 수행 능력 평가

Instruction-Following Evaluation for Large Language Models

초록

Support