PPTC基準：評估大型語言模型對於PowerPoint任務完成的表現

摘要

最近對大型語言模型（LLMs）的評估集中在測試它們在基本自然語言任務中的零樣本/少樣本能力，以及將指令翻譯為工具API的能力。然而，尚未研究利用複雜工具來完成複雜多輪、多模式環境中的指令的LLMs評估。為了填補這一空白，我們引入了PowerPoint任務完成（PPTC）基準，以評估LLMs根據用戶指令創建和編輯PPT文件的能力。它包含279個涵蓋多樣主題和涉及多模式操作的數百條指令的多輪對話。我們還提出了PPTX-Match評估系統，該系統評估LLMs是否根據預測文件完成指令，而不是根據標籤API序列，因此支持各種LLM生成的API序列。我們測試了3個閉源LLMs和6個開源LLMs。結果顯示，GPT-4在單輪對話測試中以75.1\%的準確率勝過其他LLMs，但在完成整個對話時面臨挑戰，僅實現6\%的對話準確率。我們在我們的基準測試中發現三個主要錯誤原因：多輪對話中的錯誤累積、長PPT模板處理和多模態感知。這些對未來的LLM和代理系統構成了巨大挑戰。我們在https://github.com/gydpku/PPTC 上發布了PPTC的數據、代碼和評估系統。

English

Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at https://github.com/gydpku/PPTC.

PPTC基準：評估大型語言模型對於PowerPoint任務完成的表現

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

摘要

Support