スプレッドシートにおける次のアクション予測を評価するためのベンチマークとフレームワーク

要旨

予測コード補完は開発者の作業速度を大幅に向上させる。しかし、より一般的に利用されるスプレッドシートにおいては、このような自動補完機能は事実上存在しない。このギャップを埋めるため、我々はスプレッドシート内でのユーザー操作のシーケンスを観察し、将来の操作を予測するシステム向けのベンチマークを導入する。主な課題は2点ある。(1)公開スプレッドシートコーパスに編集履歴が存在しないこと、(2)スプレッドシート操作の複雑な空間（空間的、時間的、複合的）である。(1)に対処するため、パラメータ化されたヒューリスティックとLLMによる洗練に基づき、公開コーパスのスプレッドシートを再現する52のシーケンス（計12,000操作）を手作業でキュレーションした。(2)に対処するため、各ユーザー操作後に予測を行い、その予測を受け入れるか拒否し、受け入れた場合は将来の操作を更新し、目標のスプレッドシートが得られるまでこれを繰り返すオンライン評価を提案する。我々は複数のベースライン予測器（ゼロショットLLM、ファインチューニングされたSLM、古典的モデルを含む）を用い、ベンチマークから得られる様々な特性（保存された操作と偽陽性の特性、効率性、ユーザープロファイルの影響、トリガーの影響、コンテキストの影響など）を分析する。

English

Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.