InstructExcel：Excel自然语言指令基准测试

摘要

随着大型语言模型（LLMs）的发展，我们可以解决跨不同领域，包括电子表格在内的越来越复杂的自然语言处理任务。本研究调查了LLMs是否能够生成代码（Excel OfficeScripts，一种用于在Excel中执行许多任务的TypeScript API），以解决通过自然语言用户指令提供的Excel特定任务。为此，我们引入了一个新的大规模基准测试，InstructExcel，通过利用Excel中的“自动化”功能来自动生成OfficeScripts，从而创建了该基准测试。我们的基准测试包括超过10,000个样本，涵盖了2,000个公开可用的Excel电子表格中的170多个Excel操作。在各种零样本和少样本设置下的实验表明，InstructExcel对于像GPT-4这样的最先进模型来说是一个难度较大的基准测试。我们观察到（1）使用GPT-4而不是GPT-3.5，（2）提供更多上下文示例，以及（3）动态提示可以帮助提高在该基准测试上的性能。

English

With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs can generate code (Excel OfficeScripts, a TypeScript API for executing many tasks in Excel) that solves Excel specific tasks provided via natural language user instructions. To do so we introduce a new large-scale benchmark, InstructExcel, created by leveraging the 'Automate' feature in Excel to automatically generate OfficeScripts from users' actions. Our benchmark includes over 10k samples covering 170+ Excel operations across 2,000 publicly available Excel spreadsheets. Experiments across various zero-shot and few-shot settings show that InstructExcel is a hard benchmark for state of the art models like GPT-4. We observe that (1) using GPT-4 over GPT-3.5, (2) providing more in-context examples, and (3) dynamic prompting can help improve performance on this benchmark.

InstructExcel：Excel自然语言指令基准测试

InstructExcel: A Benchmark for Natural Language Instruction in Excel

摘要

Support