「正しい」が安全とは限らない：コードエージェントが生成する機能的に正しいパッチを信頼できるか？

要旨

コードエージェントは、GitHubなどのプラットフォームでバグを自律的に修正する役割を担うことが増えているが、そのセキュリティ評価はほぼ機能的正しさに焦点を当てている。本論文では、実世界のコードエージェントに対する新たな脅威を明らかにする：機能的正しさを満たすが脆弱性を含むパッチ（Functionally Correct yet Vulnerable: FCVパッチ）である。我々が提案するFCV-Attackは、悪意のある攻撃者が意図的に作成するか、善意の開発者が無意識に導入する可能性があり、SOTAのLLM（例：ChatGPTやClaude）やエージェントスキャフォールド（例：SWE-agentやOpenHands）がこのFCV脅威に対して脆弱であることを示す。SWE-Benchにおける12のエージェント-モデル組み合わせにおいて、攻撃はブラックボックスアクセスとコードエージェントへの単一のクエリのみを必要とする。例えば、CWE-538（情報漏洩脆弱性）の場合、FCV-AttackはGPT-5 Mini + OpenHandsで40.7%の攻撃成功率を達成する。我々の結果は、現在の評価パラダイムで見過ごされている重要なセキュリティ脅威を明らかにし、コードエージェントに対するセキュリティを意識した防御策の開発を促すものである。

English

Code agents are increasingly trusted to autonomously fix bugs on platforms such as GitHub, yet their security evaluation focuses almost exclusively on functional correctness. In this paper, we reveal a novel type of threat to real-world code agents: Functionally Correct yet Vulnerable (FCV) patches, which pass all test cases but contain vulnerable code. With our proposed FCV-Attack, which can be deliberately crafted by malicious attackers or implicitly introduced by benign developers, we show that SOTA LLMs (e.g., ChatGPT and Claude) and agent scaffolds (e.g., SWE-agent and OpenHands) are all vulnerable to this FCV threat; across 12 agent-model combinations on SWE-Bench, the attack only requires black-box access and a single query to the code agent to perform the attack. For example, for CWE-538 (information exposure vulnerability), the FCV-Attack attains an attack success rate of 40.7% on GPT-5 Mini + OpenHands. Our results reveal an important security threat overlooked by current evaluation paradigms and urge the development of security-aware defenses for code agents.

「正しい」が安全とは限らない：コードエージェントが生成する機能的に正しいパッチを信頼できるか？

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

要旨

Support