PPTC Benchmark: Avaliação de Modelos de Linguagem de Grande Escala para Conclusão de Tarefas no PowerPoint

Resumo

Avaliações recentes de Modelos de Linguagem de Grande Escala (LLMs) têm se concentrado em testar suas capacidades zero-shot/few-shot para tarefas básicas de processamento de linguagem natural e sua habilidade de traduzir instruções em APIs de ferramentas. No entanto, a avaliação de LLMs utilizando ferramentas complexas para concluir instruções multi-turn e multi-modais em um ambiente complexo e multi-modal ainda não foi investigada. Para abordar essa lacuna, introduzimos o benchmark PowerPoint Task Completion (PPTC) para avaliar a capacidade dos LLMs de criar e editar arquivos PPT com base em instruções do usuário. Ele contém 279 sessões multi-turn que abrangem diversos tópicos e centenas de instruções envolvendo operações multi-modais. Também propomos o Sistema de Avaliação PPTX-Match, que avalia se os LLMs concluem a instrução com base no arquivo de previsão, em vez da sequência de APIs rotulada, permitindo assim suportar diversas sequências de APIs geradas por LLMs. Medimos 3 LLMs fechados e 6 LLMs de código aberto. Os resultados mostram que o GPT-4 supera outros LLMs com 75,1% de precisão em testes de diálogo de turno único, mas enfrenta desafios ao concluir sessões inteiras, alcançando apenas 6% de precisão na sessão. Identificamos três principais causas de erro em nosso benchmark: acúmulo de erros na sessão multi-turn, processamento de modelos de PPT longos e percepção multi-modal. Esses fatores representam grandes desafios para futuros sistemas de LLMs e agentes. Disponibilizamos os dados, código e sistema de avaliação do PPTC em https://github.com/gydpku/PPTC.

English

Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at https://github.com/gydpku/PPTC.

PPTC Benchmark: Avaliação de Modelos de Linguagem de Grande Escala para Conclusão de Tarefas no PowerPoint

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

Resumo

Support