CODA-BENCH: 코드 에이전트가 데이터 집약적 작업을 처리할 수 있는가?

초록

고급 에이전트들은 점차 자율 엔지니어로 작동할 가능성을 보여주고 있으며, 이에 따라 실제 개발 환경의 복잡성을 포착하는 평가 벤치마크에 대한 수요가 증가하고 있다. 이러한 환경은 일반적으로 복잡한 코드와 대규모 데이터(즉, 파일 시스템)를 모두 포함한다. 그러나 기존 벤치마크는 대개 코드 중심 또는 데이터 중심 능력을 개별적으로 평가하여 실제 개발 시나리오와의 명확한 간극을 남기고 있다. 본 논문에서는 이러한 간극을 해소하기 위해 데이터 집약적 환경에서 코드 및 데이터 지능을 공동으로 평가하는 최초의 벤치마크인 CODA-BENCH를 소개한다. 우리는 Kaggle 생태계(수백 개의 데이터셋 포함)를 기반으로 데이터 집약적 리눅스 샌드박스를 구축하였으며, 에이전트는 복잡한 파일 계층 구조를 능동적으로 탐색하여 관련 리소스를 식별하고 데이터 기반 분석 작업을 위한 코드를 생성해야 한다. CODA-BENCH는 31개 커뮤니티에 걸친 1,009개의 작업으로 구성되며, 각 작업 환경은 평균 980개의 파일을 포함하여 현실적인 데이터 규모와 노이즈를 시뮬레이션한다. 고급 에이전트에 대한 평가 결과, 최고 성능 시스템조차도 데이터 발견과 코드 실행을 효과적으로 통합하는 데 어려움을 겪어 성공률이 61.1%에 불과했다. 이러한 결과는 데이터 집약적 작업에 대한 현재 에이전트 능력의 상당한 격차를 강조하며, 향후 연구를 위한 유망한 방향을 제시한다.

English

Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CODA-BENCH, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks spanning 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 61.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.