SIFo 벤치마크: 대규모 언어 모델의 순차적 명령어 수행 능력 연구

초록

다중 명령어를 따르는 능력은 대규모 언어 모델(LLM)에게 중요한 역량입니다. 이를 평가하는 데는 몇 가지 주요한 과제가 있습니다: (i) 다중 명령어 간의 일관성 부족, (ii) 명령어 순서가 모델 성능에 영향을 미치는 위치 편향, 그리고 (iii) 객관적으로 검증 가능한 과제의 부족 등입니다. 이러한 문제를 해결하기 위해, 우리는 순차적 명령어 수행(SIFo) 과제를 통해 모델의 다중 명령어 수행 능력을 평가하는 벤치마크를 제안합니다. SIFo에서는 최종 명령어만 검토함으로써 다중 명령어의 성공적인 수행 여부를 검증할 수 있습니다. 우리의 벤치마크는 텍스트 수정, 질문 응답, 수학 문제 해결, 보안 규칙 준수 등 네 가지 과제를 통해 순차적 명령어 수행의 다양한 측면을 평가합니다. 주요 LLM(클로즈드 소스 및 오픈 소스 모두 포함)을 평가한 결과, 최신 및 대규모 모델이 SIFo 과제에서 이전의 소규모 모델보다 훨씬 우수한 성능을 보여 벤치마크의 유효성을 입증했습니다. 그러나 모든 모델이 명령어 시퀀스를 따르는 데 어려움을 겪는 것으로 나타나, 현재의 언어 모델이 견고성 측면에서 중요한 결함을 가지고 있음을 시사합니다.

English

Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rule following), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness. All models struggle with following sequences of instructions, hinting at an important lack of robustness of today's language models.