"정확함"이 안전함을 의미하지 않을 때: 코드 에이전트가 생성한 기능적으로 정확한 패치를 신뢰할 수 있는가?

초록

코드 에이전트는 GitHub과 같은 플랫폼에서 버그를 자율적으로 수정하는 데 점점 더 신뢰받고 있지만, 그들의 보안 평가는 거의 전적으로 기능적 정확성에 초점을 맞추고 있습니다. 본 논문에서는 실제 코드 에이전트에 대한 새로운 유형의 위협을 밝힙니다: 기능적으로는 정확하지만 취약한 코드를 포함하는 '기능적 정확성 취약 패치'(FCV)입니다. 우리가 제안한 FCV-공격은 악의적인 공격자가 의도적으로 만들거나 선의의 개발자에 의해 암묵적으로 도입될 수 있으며, 이를 통해 SOTA LLM(예: ChatGPT 및 Claude)과 에이전트 스캐폴드(예: SWE-agent 및 OpenHands)가 모두 이 FCV 위협에 취약함을 보여줍니다. SWE-Bench에서 12개의 에이전트-모델 조합에 걸쳐, 이 공격은 코드 에이전트에 대한 블랙박스 접근과 단일 쿼리만으로 수행될 수 있습니다. 예를 들어, CWE-538(정보 노출 취약점)의 경우, FCV-공격은 GPT-5 Mini + OpenHands에서 40.7%의 공격 성공률을 달성했습니다. 우리의 결과는 현재의 평가 패러다임에서 간과된 중요한 보안 위협을 드러내며, 코드 에이전트를 위한 보안 인식 방어 개발의 필요성을 촉구합니다.

English

Code agents are increasingly trusted to autonomously fix bugs on platforms such as GitHub, yet their security evaluation focuses almost exclusively on functional correctness. In this paper, we reveal a novel type of threat to real-world code agents: Functionally Correct yet Vulnerable (FCV) patches, which pass all test cases but contain vulnerable code. With our proposed FCV-Attack, which can be deliberately crafted by malicious attackers or implicitly introduced by benign developers, we show that SOTA LLMs (e.g., ChatGPT and Claude) and agent scaffolds (e.g., SWE-agent and OpenHands) are all vulnerable to this FCV threat; across 12 agent-model combinations on SWE-Bench, the attack only requires black-box access and a single query to the code agent to perform the attack. For example, for CWE-538 (information exposure vulnerability), the FCV-Attack attains an attack success rate of 40.7% on GPT-5 Mini + OpenHands. Our results reveal an important security threat overlooked by current evaluation paradigms and urge the development of security-aware defenses for code agents.