評估試算表中下一步動作預測之基準與框架

摘要

预测性代码补全大幅提升开发人员的工作速度。然而，在电子表格这种更为常见的场景中，此类自动补全功能几乎不存在。为填补这一空白，我们引入了一套基准测试，用于评估那些能观察电子表格中用户操作序列并预测后续操作的系统。两大挑战在于：(1) 公开电子表格语料库中缺乏编辑历史记录；(2) 电子表格操作（空间、时间、复合操作）的复杂空间。针对挑战(1)，我们手动整理出52个操作序列（共1.2万次操作），通过参数化启发式算法与大语言模型优化，从公开语料库中重建电子表格。针对挑战(2)，我们提出一种在线评估方法：在每次用户操作后生成预测，接受或拒绝该预测，若接受则更新后续操作序列，重复此过程直至获得目标电子表格。我们使用多种基线预测模型（包括零样本大语言模型、微调小型语言模型和经典模型），并分析了本基准测试揭示的关键特性，涵盖（但不限于）：保存操作与误报的特性、效率、用户画像影响、触发条件影响及上下文影响等。

English

Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.