Spreadsheet-RL：透過強化學習增進大型語言模型代理在真實試算表任務上的能力

摘要

試算表系統（例如 Microsoft Excel、Google Sheets）在現今以資料為核心的工作流程中扮演關鍵角色。隨著 AI 代理在自動化複雜任務（如操控電腦與生成簡報）方面能力日益增強，建構以 AI 驅動的試算表代理已成為一個具有前景的研究方向。現有大多數試算表代理依賴於對通用大型語言模型進行專門的提示設計；此設計雖在簡單試算表操作上具備潛力，卻難以處理真實應用中常見的複雜多步驟工作流程。我們提出 Spreadsheet-RL，一個專門針對在真實 Microsoft Excel 環境中訓練試算表代理的強化學習微調框架。Spreadsheet-RL 具備自動化管線，可從線上論壇大規模收集配對的起始與目標試算表，並包含在財務與供應鏈管理等領域的特定領域評估任務，我們將這些任務彙編成新的 Domain-Spreadsheet 基準資料集。此外，它還包含一個專為多回合強化學習設計的 Spreadsheet Gym 環境：Spreadsheet Gym 透過 Python 沙盒暴露 Excel 的廣泛功能，並搭配一個精煉的驅動器，該驅動器整合了完整的工具集以及針對試算表任務精心設計的工具路由規則。透過全面的實驗，我們證明 Spreadsheet-RL 能顯著提升 AI 代理在通用與特定領域試算表任務上的表現：它在 SpreadsheetBench 上將 Qwen3-4B-Thinking-2505 的 Pass@1 從 12.0% 提升至 23.4%，並在我們策劃的 Domain-Spreadsheet 資料集上將 Pass@1 從 8.4% 提升至 17.2%。這些結果凸顯了 Spreadsheet-RL 在試算表自動化方面強大的泛化潛力與實際應用前景，廣義而言，它也展現了推動日常工作中基於 LLM 的資料介面互動的潛力。

English

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent's performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL's strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work.