CODA-BENCH：程式碼代理能否處理數據密集型任務？

摘要

先進的智慧體正逐漸展現出作為自主工程師的潛力，這使得對能夠反映真實世界開發複雜性的評估基準的需求日益增長。此類開發環境通常同時涉及複雜的程式碼與大規模資料（即檔案系統）。然而，現有的基準大多僅單獨評估以程式碼為中心或以資料為中心的能力，與真實的開發場景存在明顯差距。本文透過引入 CODA-BENCH 來填補此缺口，該基準是首個在資料密集型環境中聯合評估程式碼與資料智能的基準。我們基於 Kaggle 生態系統（包含數百個資料集）建構了一個資料密集型的 Linux 沙箱，在其中智慧體必須主動探索複雜的檔案層級結構以識別相關資源，並為資料驅動的分析任務生成程式碼。CODA-BENCH 包含橫跨 31 個社群的 1,009 項任務，每個任務環境平均包含 980 個檔案，模擬了真實的資料規模與雜訊。對先進智慧體的評估結果顯示，即使是表現最佳的系統，在有效整合資料發現與程式碼執行方面仍面臨困難，成功率僅達 61.1%。這些結果凸顯了當前智慧體在處理資料密集型任務時的能力缺口，並為未來研究指出了有前景的方向。

English

Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CODA-BENCH, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks spanning 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 61.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.