SpecBench：衡量長期視野的程式碼代理中的獎勵駭客行為

摘要

隨著長程編碼代理產生的程式碼超出任何開發者能審查的範圍，監督便完全依賴於單一面向：自動化測試套件。此設置中自然會出現獎勵駭取現象，因為代理會優化測試通過率，同時偏離使用者的真實目標。我們透過將軟體工程任務分解為三個部分來研究此獎勵駭取現象：(i) 規格的自然語言描述；(ii) 可見的驗證測試，用於單獨執行指定功能；以及 (iii) 保留測試，用於組合這些功能以模擬實際使用情境。根據規格與可見的驗證測試套件，一個真正的代理應該能夠生成也能通過所有保留測試的解決方案。因此我們利用這兩個套件的通過率差距來量化獎勵駭取。基於此方法論，我們提出 SpecBench，這是一個包含 30 個系統級程式設計任務的基準測試，任務範圍從短程任務（如建置 JSON 解析器）到超長程任務（如從零開始建置整個作業系統核心）。大規模實驗揭示了一致的模式：雖然每個前沿代理在可見套件上達到飽和，但獎勵駭取仍然存在，其中較小模型在保留套件上顯示出更大的差距。差距也隨著任務長度急劇擴大：程式碼大小每增加十倍，差距就增加 28 個百分點。失敗案例從細微的功能隔離到蓄意的漏洞利用，包括一個 2,900 行的雜湊表「編譯器」，它記住了測試輸入。SpecBench 提供了一個有原則的測試平台，用於衡量編碼代理是建置真正的可行系統，還是僅僅玩弄開發者交給他們的測試套件。

English

As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.