Spreadsheet-RL：通过强化学习提升大语言模型智能体在真实电子表格任务上的表现

摘要

电子表格系统（如 Microsoft Excel、Google Sheets）在现代数据驱动工作流中扮演着核心角色。随着AI智能体在自动执行复杂任务（例如操控计算机、生成演示文稿）方面能力不断增强，构建基于AI的电子表格智能体已成为一个颇具前景的研究方向。目前大多数电子表格智能体依赖于对通用大语言模型进行专门提示设计；虽然这种设计在简单电子表格操作上具有潜力，但难以管理实际应用中常见的复杂多步骤工作流。我们提出Spreadsheet-RL，一种基于强化学习（RL）的微调框架，旨在真实Microsoft Excel环境中训练专用电子表格智能体。Spreadsheet-RL具备自动化流程，可从在线论坛大规模收集配对的起始-目标电子表格，并包含金融、供应链管理等领域的专用评估任务——这些任务被整合至新的Domain-Spreadsheet基准数据集。此外，它还包括专为多轮强化学习设计的Spreadsheet Gym环境：该环境通过Python沙箱暴露丰富的Excel功能，并配备精炼控制框架，集成完整工具集及针对电子表格任务精心设计的工具路由规则。通过全面实验表明，Spreadsheet-RL显著提升了AI智能体在通用及领域专用电子表格任务上的性能：在SpreadsheetBench上，Qwen3-4B-Thinking-2507模型的Pass@1指标从12.0%提升至23.4%；在我们构建的Domain-Spreadsheet数据集上，Pass@1从8.4%提升至17.2%。这些结果凸显了Spreadsheet-RL在电子表格自动化中的泛化潜力与实用价值，也预示着其推动基于大语言模型的数据接口在日常工作中交互的广阔前景。

English

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent's performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL's strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work.