Spreadsheet-RL: 強化学習による現実的なスプレッドシートタスクにおける大規模言語モデルエージェントの進化

要旨

スプレッドシートシステム（例：Microsoft Excel、Google Sheets）は、現代のデータ中心的なワークフローにおいて中心的な役割を担っている。AIエージェントがコンピュータ制御やプレゼンテーション生成といった複雑なタスクを自動化する能力を高めるにつれ、AI駆動型のスプレッドシートエージェントの構築は有望な研究の方向性として浮上している。既存のスプレッドシートエージェントのほとんどは、汎用LLMに対する特殊なプロンプトに依存している。この設計は単純なスプレッドシート操作には可能性を秘めるものの、実際のアプリケーションで典型的な複雑で多段階のワークフローを管理するには困難を伴う。本稿では、現実的なMicrosoft Excel環境内で特殊なスプレッドシートエージェントを訓練するために設計された、強化学習（RL）ファインチューニングフレームワークであるSpreadsheet-RLを紹介する。Spreadsheet-RLは、オンラインフォーラムからペアとなった開始時と目標時のスプレッドシートを大規模に収集するための自動パイプラインと、金融やサプライチェーン管理などの領域におけるドメイン固有の評価タスク（これらを新しいDomain-Spreadsheetベンチマークデータセットとしてまとめた）を特徴とする。さらに、多ターンRL向けに設計されたSpreadsheet Gym環境も含む。Spreadsheet Gymは、Pythonサンドボックスを通じて広範なExcel機能を公開するとともに、スプレッドシートタスク向けに包括的なツールセットと注意深く設計されたツールルーティングルールを組み込んだ洗練されたハーネスを提供する。包括的な実験を通じて、Spreadsheet-RLが一般的なスプレッドシートタスクとドメイン固有のスプレッドシートタスクの両方において、AIエージェントのパフォーマンスを大幅に向上させることを示す。具体的には、SpreadsheetBenchにおけるQwen3-4B-Thinking-2507のPass@1を12.0%から23.4%に改善し、私たちが厳選したDomain-SpreadsheetデータセットにおいてはPass@1を8.4%から17.2%に向上させた。これらの結果は、スプレッドシート自動化におけるSpreadsheet-RLの汎化能力と実世界への応用の可能性が高いこと、そして広くは、日常業務におけるデータインターフェースとのLLMベースの相互作用を前進させる上での有望性を強調するものである。

English

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent's performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL's strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work.