핀치: 스프레드시트 중심 기업 업무 흐름에서의 재무·회계 벤치마킹

초록

우리는 실제 기업 수준의 전문 업무 흐름(데이터 입력, 구조화, 서식 지정, 웹 검색, 파일 간 검색, 계산, 모델링, 검증, 번역, 시각화, 보고 등이 복합적으로 이루어지는)에서 AI 에이전트를 평가하기 위한 금융·회계 벤치마크(Finch)를 소개한다. Finch는 엔론(150명의 직원으로부터 수집한 15,000개의 스프레드시트와 50만 통의 이메일) 및 기타 금융 기관의 실제 기업 작업 환경에서 확보한 자료를 바탕으로 하여, 다양한 모달리티(텍스트, 표, 수식, 차트, 코드, 이미지)에 걸친 실제 환경의 복잡성을 그대로 보존하며, 예산 편성, 트레이딩, 자산 관리 등 다양한 영역을 아우른다. 우리는 LLM 지원 발견과 전문가 주석화를 결합한 워크플로우 구축 프로세스를 제안한다: (1) 실제 이메일 스레드와 스프레드시트 파일 버전 기록에서 LLM을 지원하여 워크플로우를 도출하고 전문가가 검증하는 단계, (2) 700시간 이상의 도메인 전문가 노력이 투입된 워크플로우에 대한 세심한 전문가 주석화 단계. 이를 통해 172개의 복합 워크플로우와 384개 태스크, 2,710만 개의 셀을 포함하는 1,710개의 스프레드시트, PDF 및 기타 아티팩트로 구성된 데이터셋이 구축되어, 실제 기업 업무의 본질적으로 복잡하고 장기적이며, 지식 집약적이고 협업적인 특성을 포착한다. 우리는 GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, Qwen 3 Max를 포함한 최첨단 AI 시스템에 대한 인간 평가와 자동화 평가를 수행했으며, GPT 5.1 Pro는 총 48시간을 소요했음에도 전체 워크플로우의 38.4%만 통과했고, Claude Sonnet 4.5는 25.0%만 통과했다. 포괄적인 사례 연구를 통해 실제 기업 워크플로우가 AI 에이전트에게 제기하는 도전 과제를 추가로 파악하였다.

English

We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

핀치: 스프레드시트 중심 기업 업무 흐름에서의 재무·회계 벤치마킹

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

초록

Support