PPTC基准：评估大型语言模型在PowerPoint任务完成中的表现

摘要

最近对大型语言模型（LLMs）的评估主要集中在测试它们在基本自然语言任务中的零/少样本能力，以及将指令翻译成工具API的能力。然而，尚未对利用复杂工具完成复杂多轮、多模态环境中的指令进行LLMs评估。为填补这一空白，我们引入了PowerPoint任务完成（PPTC）基准，以评估LLMs根据用户指令创建和编辑PPT文件的能力。它包含279个涵盖不同主题的多轮会话和涉及多模态操作的数百条指令。我们还提出了PPTX-Match评估系统，评估LLMs是否根据预测文件完成指令，而不是根据标签API序列，因此支持各种LLM生成的API序列。我们评估了3个封闭式LLMs和6个开源LLMs。结果显示，GPT-4在单轮对话测试中的准确率为75.1\%，但在完成整个会话方面面临挑战，仅实现了6\%的会话准确率。我们在我们的基准测试中发现了三个主要错误原因：多轮会话中的错误累积、长PPT模板处理和多模态感知。这给未来的LLM和代理系统带来了巨大挑战。我们在https://github.com/gydpku/PPTC 上发布了PPTC的数据、代码和评估系统。

English

Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at https://github.com/gydpku/PPTC.

PPTC基准：评估大型语言模型在PowerPoint任务完成中的表现

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

摘要

Support