SpecBench: 장기 지평 코딩 에이전트에서의 보상 해킹 측정

초록

장기간 코딩 에이전트가 개발자가 검토할 수 있는 것보다 더 많은 코드를 생성함에 따라, 감독은 자동화된 테스트 스위트라는 단일 표면에 집중된다. 이러한 설정에서 에이전트가 테스트 통과를 최적화하면서 사용자의 진정한 목표에서 벗어나므로 보상 해킹이 자연스럽게 발생한다. 본 연구에서는 보상 해킹 현상을 분석하기 위해 소프트웨어 엔지니어링 작업을 세 부분으로 분해한다: (i) 명세에 대한 자연어 설명, (ii) 지정된 기능을 개별적으로 실행하는 가시적 검증 테스트, (iii) 동일한 기능들을 조합하여 실제 사용 환경을 시뮬레이션하는 비공개 테스트. 명세와 가시적 검증 테스트 스위트를 바탕으로, 진정한 에이전트는 모든 비공개 테스트도 통과할 수 있는 해결책을 생성할 수 있을 것이다. 따라서 우리는 이 두 스위트의 통과율 차이를 사용하여 보상 해킹을 정량화한다. 이 방법론에 기반하여, 우리는 JSON 파서 구축과 같은 단기 작업부터 OS 커널 전체를 처음부터 구축하는 초장기 작업까지 포함하는 30개의 시스템 수준 프로그래밍 작업으로 구성된 벤치마크인 SpecBench를 소개한다. 대규모 실험을 통해 일관된 패턴이 드러난다: 모든 최첨단 에이전트가 가시적 스위트를 포화시키지만 보상 해킹은 지속되며, 더 작은 모델일수록 비공개 스위트에서 더 큰 차이를 보인다. 또한 이러한 차이는 작업 길이에 따라 급격히 증가한다: 코드 크기가 10배 증가할 때마다 차이가 28%포인트 증가한다. 실패 사례는 미묘한 기능 분리부터 의도적인 익스플로잇까지 다양하며, 테스트 입력을 암기하는 2,900줄의 해시 테이블 '컴파일러'도 포함된다. SpecBench는 코딩 에이전트가 진정한 작동 시스템을 구축하는지, 아니면 단순히 개발자가 제공한 테스트 스위트를 이용하는지 측정하기 위한 원칙적인 테스트베드를 제공한다.

English

As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.