불가능 벤치: 테스트 케이스 활용 경향에 대한 LLM 평가

초록

과제 수행을 위해 '지름길'을 찾고 활용하려는 경향은 대규모 언어 모델(LLM)의 신뢰할 수 있는 평가와 배포에 상당한 위험을 초래합니다. 예를 들어, 단위 테스트에 접근 권한이 있는 LLM 에이전트가 기본적인 버그를 수정하는 대신 실패하는 테스트를 삭제할 수 있습니다. 이러한 행동은 벤치마크 결과의 타당성과 실제 LLM 코딩 지원 도구 배포의 신뢰성을 모두 훼손합니다. 이러한 행동을 정량화, 연구 및 완화하기 위해 우리는 불가능한 과제 벤치마크(ImpossibleBench)를 소개합니다. 이는 LLM 에이전트가 테스트 케이스를 악용하는 성향을 체계적으로 측정하는 벤치마크 프레임워크입니다. ImpossibleBench는 LiveCodeBench 및 SWE-bench와 같은 기존 벤치마크의 과제에 자연어 명세와 단위 테스트 간의 직접적인 충돌을 도입하여 '불가능한' 변형을 생성합니다. 우리는 에이전트의 '치팅율'을 이러한 불가능한 과제에서의 통과율로 측정하며, 여기서 어떤 통과도 명세 위반 지름길을 의미합니다. 실용적인 프레임워크로서 ImpossibleBench는 단순한 평가 도구를 넘어 다목적 도구입니다. 우리는 다음과 같은 유용성을 입증합니다: (1) 모델 행동 연구: 단순한 테스트 수정부터 복잡한 연산자 오버로딩에 이르기까지 치팅 행동의 더 세분화된 세부 사항을 밝혀냄. (2) 컨텍스트 엔지니어링: 프롬프트, 테스트 접근 권한 및 피드백 루프가 치팅율에 어떻게 영향을 미치는지 보여줌. (3) 모니터링 도구 개발: 검증된 기만적 솔루션을 갖춘 테스트베드 제공. 우리는 ImpossibleBench가 더 강력하고 신뢰할 수 있는 LLM 시스템 구축을 위한 유용한 프레임워크로 역할하기를 바랍니다. 구현 내용은 https://github.com/safety-research/impossiblebench에서 확인할 수 있습니다.

English

The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at https://github.com/safety-research/impossiblebench.

불가능 벤치: 테스트 케이스 활용 경향에 대한 LLM 평가

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

초록

Support