InstructExcel：Excel 中自然語言指令的基準測試

摘要

隨著大型語言模型（LLMs）的演進，我們可以解決越來越複雜的自然語言處理任務，跨越各種領域，包括試算表。本研究調查LLMs是否能夠生成程式碼（Excel OfficeScripts，一種用於在Excel中執行多項任務的TypeScript API），以解決通過自然語言用戶指令提供的Excel特定任務。為此，我們引入了一個新的大規模基準測試，InstructExcel，通過利用Excel中的“自動化”功能，自動從用戶的操作生成OfficeScripts。我們的基準測試包括超過10,000個樣本，涵蓋了2000個公開可用的Excel試算表中的170多個Excel操作。在各種零樣本和少樣本設置下進行的實驗表明，InstructExcel對於像GPT-4這樣的最先進模型來說是一個難度較高的基準測試。我們觀察到（1）使用GPT-4而不是GPT-3.5，（2）提供更多上下文示例，以及（3）動態提示可以幫助提高在這個基準測試上的性能。

English

With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs can generate code (Excel OfficeScripts, a TypeScript API for executing many tasks in Excel) that solves Excel specific tasks provided via natural language user instructions. To do so we introduce a new large-scale benchmark, InstructExcel, created by leveraging the 'Automate' feature in Excel to automatically generate OfficeScripts from users' actions. Our benchmark includes over 10k samples covering 170+ Excel operations across 2,000 publicly available Excel spreadsheets. Experiments across various zero-shot and few-shot settings show that InstructExcel is a hard benchmark for state of the art models like GPT-4. We observe that (1) using GPT-4 over GPT-3.5, (2) providing more in-context examples, and (3) dynamic prompting can help improve performance on this benchmark.

InstructExcel：Excel 中自然語言指令的基準測試

InstructExcel: A Benchmark for Natural Language Instruction in Excel

摘要

Support