ChatPaper.aiChatPaper

SIFo基准测试:探究大型语言模型的顺序指令跟随能力

The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

June 28, 2024
作者: Xinyi Chen, Baohao Liao, Jirui Qi, Panagiotis Eustratiadis, Christof Monz, Arianna Bisazza, Maarten de Rijke
cs.AI

摘要

对于大型语言模型(LLMs)来说,遵循多条指令是一项至关重要的能力。评估这种能力存在着重大挑战:(i)多条指令之间的连贯性有限,(ii)位置偏见,即指令顺序影响模型性能,以及(iii)缺乏客观可验证的任务。为了解决这些问题,我们引入了一个基准测试,旨在通过顺序指令跟踪(SIFo)任务评估模型遵循多条指令的能力。在SIFo中,通过仅检查最终指令即可验证成功完成多条指令。我们的基准测试使用四个任务(文本修改、问题回答、数学和安全规则遵循)来评估指令跟踪的能力,每个任务评估顺序指令跟踪的不同方面。我们对流行的LLMs进行评估,包括闭源和开源模型,结果显示,更新且更大的模型在SIFo任务上明显优于旧的和较小的模型,验证了基准测试的有效性。所有模型在遵循指令序列方面都存在困难,暗示了当今语言模型重要的鲁棒性缺失。
English
Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rule following), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness. All models struggle with following sequences of instructions, hinting at an important lack of robustness of today's language models.

Summary

AI-Generated Summary

PDF41November 28, 2024