SIFo基準測試:探討大型語言模型的序列指令跟隨能力
The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
June 28, 2024
作者: Xinyi Chen, Baohao Liao, Jirui Qi, Panagiotis Eustratiadis, Christof Monz, Arianna Bisazza, Maarten de Rijke
cs.AI
摘要
對於大型語言模型(LLMs)而言,遵循多個指令是一項至關重要的能力。評估這種能力面臨著重大挑戰:(i)多個指令之間的連貫性有限,(ii)位置偏見,即指令的順序影響模型性能,以及(iii)缺乏客觀可驗證的任務。為應對這些問題,我們引入了一個基準,旨在通過連續指令跟隨(SIFo)任務來評估模型遵循多個指令的能力。在SIFo中,通過僅檢查最終指令即可驗證成功完成多個指令。我們的基準通過四個任務(文本修改、問答、數學和安全規則遵循)來評估指令跟隨,每個任務評估連續指令跟隨的不同方面。我們對流行的LLMs進行評估,包括封閉源碼和開源模型,結果顯示,較新且更大的模型在SIFo任務上明顯優於舊的和較小的對應物,從而驗證了該基準的有效性。所有模型在遵循指令序列方面都存在困難,暗示當今語言模型的重要韌性缺失。
English
Following multiple instructions is a crucial ability for large language
models (LLMs). Evaluating this ability comes with significant challenges: (i)
limited coherence between multiple instructions, (ii) positional bias where the
order of instructions affects model performance, and (iii) a lack of
objectively verifiable tasks. To address these issues, we introduce a benchmark
designed to evaluate models' abilities to follow multiple instructions through
sequential instruction following (SIFo) tasks. In SIFo, the successful
completion of multiple instructions is verifiable by examining only the final
instruction. Our benchmark evaluates instruction following using four tasks
(text modification, question answering, mathematics, and security rule
following), each assessing different aspects of sequential instruction
following. Our evaluation of popular LLMs, both closed-source and open-source,
shows that more recent and larger models significantly outperform their older
and smaller counterparts on the SIFo tasks, validating the benchmark's
effectiveness. All models struggle with following sequences of instructions,
hinting at an important lack of robustness of today's language models.Summary
AI-Generated Summary