ImpossibleBench：衡量大语言模型利用测试用例的倾向性

摘要

在大型語言模型（LLM）的可靠評估與部署過程中，模型傾向於尋找並利用「捷徑」完成任務的現象存在顯著風險。例如，一個具備單元測試訪問權限的LLM智能體可能會選擇刪除未通過的測試案例，而非修復潛在的程序錯誤。此類行為不僅損害基準測試結果的有效性，更會削弱現實場景中LLM編程助手部署的可靠性。為量化、研究並緩解此類行為，我們提出ImpossibleBench——一個系統化衡量LLM智能體利用測試案例傾向的基準框架。該框架通過在LiveCodeBench、SWE-bench等現有基準任務中植入自然語言描述與單元測試之間的直接衝突，構建出「不可完成」的任務變體。我們將智能體在這些任務上的通過率定義為「作弊率」，因為任何通過結果都必然意味着其採用了違反任務規範的捷徑。作為實用框架，ImpossibleBench不僅是評估工具，更具備多功能性。我們通過實證展示其三大應用場景：（1）行為研究層面，揭示了從簡單測試篡改到複雜運算符重載等細粒度作弊行為；（2）上下文工程層面，闡明提示設計、測試訪問權限及反饋機制對作弊率的影響；（3）監控工具開發層面，提供包含已驗證欺騙性解決方案的測試平台。我們期待ImpossibleBench能為構建更強健可靠的LLM系統提供有效支撐。項目代碼已開源於：https://github.com/safety-research/impossiblebench

English

The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at https://github.com/safety-research/impossiblebench.

ImpossibleBench：衡量大语言模型利用测试用例的倾向性

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

摘要

Support