评估电子表格中下一步动作预测的基准与框架

摘要

预测性代码补全极大地加速了开发者的工作速度。然而，在电子表格这一更为常见的应用场景中，此类自动补全功能几乎不存在。为填补这一空白，我们引入了一个基准测试系统，该系统可观察电子表格中的用户操作序列，并预测后续操作。其中面临两大挑战：（1）公开的电子表格语料库缺少编辑历史记录；（2）电子表格操作（包括空间、时间及复合操作）的复杂空间。针对挑战（1），我们手动整理了52个包含1.2万次操作的序列，这些序列以参数化启发式算法和大型语言模型（LLM）优化为基础，重现了公开语料库中的电子表格。针对挑战（2），我们提出了一种在线评估方法：在每次用户操作后进行预测，接受或拒绝该预测，若接受则更新后续操作，并重复此过程直至获得目标电子表格。我们采用了多种基线预测模型（包括零样本LLM、微调小语言模型（SLM）及经典模型），并分析了该基准测试所揭示的多种特性，包括但不限于：已保存操作与误报的特性、效率、用户画像的影响、触发条件的影响以及上下文的影响。

English

Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.