Finch: スプレッドシート中心のエンタープライズワークフローにおける財務・会計のベンチマーキング

要旨

我々は、実世界のエンタープライズレベルの専門業務ワークフローにおいてAIエージェントを評価するための財務・会計ベンチマーク（Finch）を提案する。これはデータ入力、構造化、書式設定、ウェブ検索、ファイル横断検索、計算、モデリング、検証、翻訳、可視化、レポート作成を組み合わせたものである。Finchはエンロン（150名の従業員による15,000のスプレッドシートと50万通のメール）およびその他の金融機関の実際の業務環境から収集され、マルチモーダルな成果物（テキスト、表、数式、チャート、コード、画像）にわたる実世界の複雑さを保持し、予算管理、トレーディング、資産管理など多様な領域を網羅している。ワークフロー構築プロセスとして、LLM支援による発見と専門家による注釈を組み合わせた手法を提案する：（1）実世界のメールスレッドとスプレッドシートのバージョン履歴から、LLM支援によるワークフローの抽出を専門家が検証、（2）700時間以上のドメイン専門家による緻密なワークフロー注釈。これにより172の複合ワークフロー（384タスク）が構築され、27百万セルを含む1,710のスプレッドシートとPDF等の成果物から、実企業業務に固有の複雑性、長期性、知識集約性、協調性を捉えている。 GPT 5.1、Claude Sonnet 4.5、Gemini 3 Pro、Grok 4、Qwen 3 Maxなどの先進AIシステムについて人間評価と自動評価を実施。GPT 5.1 Proは合計48時間を要しながらワークフローの38.4%のみを通過、Claude Sonnet 4.5は25.0%の通過率であった。詳細なケーススタディにより、実企業ワークフローがAIエージェントに課す課題がさらに明らかになっている。

English

We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

Finch: スプレッドシート中心のエンタープライズワークフローにおける財務・会計のベンチマーキング

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

要旨

Support