Finch:基於試算表為核心的企業工作流程之財務會計基準測試
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
December 15, 2025
作者: Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Adina Yakefu, Shuxin Zheng
cs.AI
摘要
我們推出金融與會計基準測試(Finch),用於評估AI代理在真實企業級專業工作流程中的表現——這些流程交織著數據輸入、結構化、格式化、網絡搜索、跨文件檢索、計算、建模、驗證、翻譯、可視化及報告生成。Finch數據源取自安然公司(涵蓋150名員工的15,000份電子表格與50萬封郵件)及其他金融機構的真實企業工作環境,完整保留多模態資料(文本、表格、公式、圖表、代碼與圖像)在真實場景中的雜亂特性,覆蓋預算編制、交易執行與資產管理等多元領域。
我們提出結合大型語言模型輔助發現與專家標註的工作流構建流程:(1)通過LLM輔助並經專家驗證的方式,從真實郵件線程與電子表格版本歷史中推導工作流;(2)對工作流進行精細化專家標註,耗費逾700小時領域專家工時。最終構建出包含384項任務的172個複合工作流,涉及1,710份含2700萬個單元格的電子表格及PDF等輔助文件,精準捕捉真實企業工作中固有的雜亂性、長期性、知識密集性與協作特性。
我們對包括GPT 5.1、Claude Sonnet 4.5、Gemini 3 Pro、Grok 4與Qwen 3 Max在內的前沿AI系統開展人機雙重評估:GPT 5.1 Pro耗時48小時僅通過38.4%的工作流,而Claude Sonnet 4.5通過率僅25.0%。深度案例分析進一步揭示真實企業工作流為AI代理帶來的核心挑戰。
English
We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management.
We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work.
We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.