O Benchmark SIFo: Investigando a Capacidade de Seguir Instruções Sequenciais de Modelos de Linguagem Grandes

Resumo

Seguir múltiplas instruções é uma habilidade crucial para grandes modelos de linguagem (LLMs). Avaliar essa habilidade apresenta desafios significativos: (i) coerência limitada entre múltiplas instruções, (ii) viés posicional onde a ordem das instruções afeta o desempenho do modelo e (iii) falta de tarefas objetivamente verificáveis. Para lidar com esses problemas, apresentamos um benchmark projetado para avaliar as habilidades dos modelos de seguir múltiplas instruções por meio de tarefas sequenciais de seguimento de instruções (SIFo). No SIFo, a conclusão bem-sucedida de múltiplas instruções é verificável examinando apenas a instrução final. Nosso benchmark avalia o seguimento de instruções usando quatro tarefas (modificação de texto, resposta a perguntas, matemática e seguimento de regras de segurança), cada uma avaliando diferentes aspectos do seguimento de instruções sequenciais. Nossa avaliação de LLMs populares, tanto de código fechado quanto de código aberto, mostra que modelos mais recentes e maiores superam significativamente seus antecessores mais antigos e menores nas tarefas SIFo, validando a eficácia do benchmark. Todos os modelos enfrentam dificuldades em seguir sequências de instruções, indicando uma importante falta de robustez nos modelos de linguagem atuais.

English

Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rule following), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness. All models struggle with following sequences of instructions, hinting at an important lack of robustness of today's language models.

O Benchmark SIFo: Investigando a Capacidade de Seguir Instruções Sequenciais de Modelos de Linguagem Grandes

The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

Resumo

Support