AI 에이전트가 데이터 질문에 답할 수 있을까? 데이터 에이전트를 위한 벤치마크

초록

기업 사용자들은 자연어를 통해 데이터를 질의하기 위해 점점 더 AI 에이전트에 의존하고 있습니다. 그러나 현실 세계의 데이터는 종종 여러 이기종 데이터베이스 시스템에 분산되어 있고, 일관되지 않은 참조를 가지며 비정형 텍스트에 정보가 숨겨져 있기 때문에 신뢰할 수 있는 데이터 에이전트를 구축하는 것은 여전히 어렵습니다. 기존 벤치마크는 이 문제의 개별적인 부분만 다룹니다. 예를 들어, 자연어 질문을 SQL 쿼리로 변환하거나 컨텍스트로 제공된 작은 테이블에 대한 질문에 답하는 것 등이 있지만, 여러 데이터베이스 시스템에 걸쳐 데이터를 통합, 변환, 분석하는 전체 파이프라인을 평가하지는 않습니다. 이러한 격차를 메우기 위해, 우리는 6개 산업 분야의 기업 데이터 에이전트 워크로드에 대한 형성 연구를 바탕으로 Data Agent Benchmark(DAB)를 제시합니다. DAB는 12개의 데이터셋, 9개의 도메인, 4개의 데이터베이스 관리 시스템에 걸친 54개의 질의로 구성됩니다. DAB에서 최고의 프론티어 모델(Gemini-3-Pro)은 pass@1 정확도가 38%에 불과했습니다. 우리는 5개의 프론티어 LLM을 벤치마크하고, 그들의 실패 모드를 분석하며, 향후 데이터 에이전트 개발을 위한 시사점을 도출합니다. 우리의 벤치마크와 실험 코드는 github.com/ucbepic/DataAgentBench에 공개되어 있습니다.

English

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous database systems, with inconsistent references and information buried in unstructured text. Existing benchmarks only tackle individual pieces of this problem -- e.g., translating natural-language questions into SQL queries, answering questions over small tables provided in context -- but do not evaluate the full pipeline of integrating, transforming, and analyzing data across multiple database systems. To fill this gap, we present the Data Agent Benchmark (DAB), grounded in a formative study of enterprise data agent workloads across six industries. DAB comprises 54 queries across 12 datasets, 9 domains, and 4 database management systems. On DAB, the best frontier model (Gemini-3-Pro) achieves only 38% pass@1 accuracy. We benchmark five frontier LLMs, analyze their failure modes, and distill takeaways for future data agent development. Our benchmark and experiment code are published at github.com/ucbepic/DataAgentBench.

AI 에이전트가 데이터 질문에 답할 수 있을까? 데이터 에이전트를 위한 벤치마크

Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents

초록

Support