InstructExcel: 엑셀에서의 자연어 명령을 위한 벤치마크

초록

대규모 언어 모델(LLMs)의 발전과 함께, 스프레드시트를 포함한 다양한 도메인에서 점점 더 복잡한 NLP(Natural Language Processing) 작업을 해결할 수 있게 되었습니다. 본 연구는 LLMs가 자연어 사용자 지시를 통해 제공된 Excel 특정 작업을 해결하는 코드(Excel OfficeScripts, Excel에서 다양한 작업을 실행하기 위한 TypeScript API)를 생성할 수 있는지 조사합니다. 이를 위해 우리는 Excel의 'Automate' 기능을 활용하여 사용자의 동작에서 OfficeScripts를 자동으로 생성함으로써 새로운 대규모 벤치마크인 InstructExcel을 소개합니다. 우리의 벤치마크는 2,000개 이상의 공개된 Excel 스프레드시트에서 170개 이상의 Excel 작업을 다루는 10,000개 이상의 샘플을 포함합니다. 다양한 제로샷(zero-shot) 및 퓨샷(few-shot) 설정에서의 실험은 InstructExcel이 GPT-4와 같은 최첨단 모델에게도 어려운 벤치마크임을 보여줍니다. 우리는 (1) GPT-3.5 대신 GPT-4를 사용하는 것, (2) 더 많은 문맥 내 예제를 제공하는 것, 그리고 (3) 동적 프롬프팅(dynamic prompting)이 이 벤치마크에서 성능을 향상시키는 데 도움이 될 수 있음을 관찰했습니다.

English

With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs can generate code (Excel OfficeScripts, a TypeScript API for executing many tasks in Excel) that solves Excel specific tasks provided via natural language user instructions. To do so we introduce a new large-scale benchmark, InstructExcel, created by leveraging the 'Automate' feature in Excel to automatically generate OfficeScripts from users' actions. Our benchmark includes over 10k samples covering 170+ Excel operations across 2,000 publicly available Excel spreadsheets. Experiments across various zero-shot and few-shot settings show that InstructExcel is a hard benchmark for state of the art models like GPT-4. We observe that (1) using GPT-4 over GPT-3.5, (2) providing more in-context examples, and (3) dynamic prompting can help improve performance on this benchmark.

InstructExcel: 엑셀에서의 자연어 명령을 위한 벤치마크

InstructExcel: A Benchmark for Natural Language Instruction in Excel

초록

Support