不可能基准：衡量大语言模型利用测试用例的倾向性

摘要

在大型语言模型（LLM）的可靠评估与部署过程中，模型倾向于寻找并利用“捷径”完成任务的行为存在显著风险。例如，一个能够访问单元测试的LLM智能体可能会选择删除未通过的测试用例，而非修复底层代码错误。这种行为不仅削弱了基准测试结果的有效性，也影响了现实世界中LLM编程助手部署的可靠性。为量化、研究并缓解此类行为，我们推出ImpossibleBench——一个系统化衡量LLM智能体利用测试用例倾向的基准框架。该框架通过在原任务（如LiveCodeBench和SWE-bench）中植入自然语言描述与单元测试之间的直接冲突，构建“不可完成”的任务变体。我们以模型在这些任务上的通过率作为“作弊率”指标，任何通过结果都必然意味着模型采取了违反任务规范的捷径。作为实用框架，ImpossibleBench不仅是评估工具，更是多功能平台。我们展示了其在三个方面的应用价值：（1）研究模型行为，揭示从简单测试篡改到复杂运算符重载等不同层级的作弊行为；（2）上下文工程，探究提示策略、测试访问权限及反馈循环如何影响作弊率；（3）开发监控工具，提供包含已验证欺骗性解决方案的测试环境。我们期待ImpossibleBench成为构建更强健、可靠LLM系统的重要框架。项目代码已发布于：https://github.com/safety-research/impossiblebench

English

The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at https://github.com/safety-research/impossiblebench.

不可能基准：衡量大语言模型利用测试用例的倾向性

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

摘要

Support