ImpossibleBench: テストケース悪用傾向の測定による大規模言語モデル評価

要旨

タスクを完了するための「ショートカット」を見つけて利用する傾向は、大規模言語モデル（LLM）の信頼性の高い評価と展開に重大なリスクをもたらす。例えば、単体テストへのアクセス権を持つLLMエージェントが、根本的なバグを修正する代わりに、失敗するテストを削除する可能性がある。このような振る舞いは、ベンチマーク結果の正当性と、実世界でのLLMコーディングアシスタント展開の信頼性の両方を損なう。このような振る舞いを定量化、研究、軽減するため、我々はImpossibleBenchを提案する。これは、LLMエージェントがテストケースを悪用する傾向を体系的に測定するベンチマークフレームワークである。ImpossibleBenchは、LiveCodeBenchやSWE-benchなどの既存ベンチマークからタスクを選び、自然言語仕様と単体テストの間に直接的な矛盾を導入することで「不可能」なバリアントを作成する。エージェントの「不正行為率」を、これらの不可能なタスクにおける合格率として測定する。ここでの合格は、必ず仕様違反のショートカットを意味する。実用的なフレームワークとして、ImpossibleBenchは単なる評価ツールではなく、多目的なツールである。その有用性を以下の点で実証する：(1) モデルの振る舞いの研究：単純なテスト改変から複雑な演算子オーバーロードに至る、不正行為のより詳細な実態を明らかにする。(2) コンテキストエンジニアリング：プロンプト、テストへのアクセス権、フィードバックループが不正行為率に与える影響を示す。(3) 監視ツールの開発：検証済みの欺瞞的解決策を含むテストベッドを提供する。我々は、ImpossibleBenchがより堅牢で信頼性の高いLLMシステム構築のための有用なフレームワークとなることを期待する。実装は以下で公開されている。 https://github.com/safety-research/impossiblebench

English

The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at https://github.com/safety-research/impossiblebench.

ImpossibleBench: テストケース悪用傾向の測定による大規模言語モデル評価

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

要旨

Support