스프레드시트 내 다음 행동 예측 평가를 위한 벤치마크와 프레임워크

초록

예측 코드 완성 기능은 개발자의 작업 속도를 크게 향상시킨다. 스프레드시트에서는 훨씬 더 보편적임에도 불구하고 이러한 자동 완성 기능은 사실상 존재하지 않는다. 이러한 격차를 해소하기 위해, 본 연구는 스프레드시트에서 일련의 사용자 작업을 관찰하고 미래 작업을 예측하는 시스템을 위한 벤치마크를 도입한다. 두 가지 과제는 (1) 공개 스프레드시트 코퍼스에서 편집 이력이 부재하다는 점과 (2) 스프레드시트 작업(공간적, 시간적, 복합적)의 복잡한 공간이다. (1)을 해결하기 위해, 매개변수화된 휴리스틱과 LLM 정제에 의해 시드된, 공개 코퍼스의 스프레드시트를 재현하는 12,000개 작업으로 구성된 52개 시퀀스를 수동으로 선별한다. (2)를 해결하기 위해, 각 사용자 작업 후 예측을 기대하고, 해당 예측을 수락 또는 거절하며, 수락 시 미래 작업을 업데이트하고, 대상 스프레드시트를 얻을 때까지 이를 반복하는 온라인 평가를 제안한다. 우리는 여러 기준 예측기(제로샷 LLM, 미세 조정된 SLM, 고전적 모델 포함)를 사용하고, 저장된 작업 및 거짓 양성의 속성, 효율성, 사용자 프로필의 효과, 트리거의 효과, 컨텍스트의 효과 등을 포함하되 이에 국한되지 않는 벤치마크가 알려주는 다양한 특성을 분석한다.

English

Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.