Finch:面向电子表格核心企业工作流的财务与会计基准测试框架
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
December 15, 2025
作者: Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Adina Yakefu, Shuxin Zheng
cs.AI
摘要
我们推出金融与会计基准测试(Finch),用于评估AI智能体在真实企业级专业工作流中的表现——涵盖数据录入、结构化处理、格式调整、网络搜索、跨文件检索、计算分析、建模验证、语言翻译、可视化呈现及报告生成等交织性任务。该基准源自从安然公司(提取150名员工的15,000份电子表格与50万封邮件)及其他金融机构的真实工作环境,完整保留多模态素材(文本、表格、公式、图表、代码及图像)的原始复杂性,覆盖预算编制、交易执行、资产管理等多元业务领域。
我们提出结合大语言模型辅助发现与专家标注的工作流构建流程:(1)通过LLM辅助推导并经专家核实的真实邮件线程与电子表格版本历史还原工作流;(2)投入超700小时专家工时进行精细化工作流标注。最终形成包含384项任务的172个复合工作流,涉及1,710个含2700万单元格的电子表格及PDF等附属文件,精准捕捉了企业工作中固有的混乱性、长期性、知识密集性与协作性特征。
我们对包括GPT 5.1、Claude Sonnet 4.5、Gemini 3 Pro、Grok 4和Qwen 3 Max在内的前沿AI系统开展人工与自动化评估。结果显示,GPT 5.1 Pro耗时48小时仅通过38.4%的工作流,Claude Sonnet 4.5通过率低至25.0%。深度案例研究进一步揭示了真实企业工作流为AI智能体带来的核心挑战。
English
We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management.
We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work.
We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.