방법을 진화시키되 프롬프트는 진화시키지 말라: LLM에 대한 재닉스 공격의 진화적 합성

초록

대규모 언어 모델(LLM)을 위한 자동화된 레드 팀링 프레임워크는 점점 더 정교해지고 있지만, 근본적인 한계를 공유합니다. 바로 재택 브레이크 로직이 기존 공격 전략을 선택, 결합 또는 개선하는 데 국한된다는 점입니다. 이는 창의성을 제한하고 완전히 새로운 공격 메커니즘을 자율적으로 발명할 수 없게 만듭니다. 이러한 격차를 극복하기 위해 우리는 패러다임을 공격 계획에서 재택 브레이크 방법의 진화적 합성으로 전환하는 자율 프레임워크인 EvoSynth를 소개합니다. EvoSynth는 프롬프트를 개선하는 대신, 다중 에이전트 시스템을 활용하여 코드 기반의 새로운 공격 알고리즘을 자율적으로 설계, 진화 및 실행합니다. 중요한 것은 코드 수준의 자체 수정 루프를 갖추고 있어 실패에 대응하여 자체 공격 로직을 반복적으로 재작성할 수 있습니다. 광범위한 실험을 통해 우리는 EvoSynth가 Claude-Sonnet-4.5와 같이 매우 강력한 모델에 대해 85.5%의 공격 성공률(ASR)을 달성하여 새로운 최첨단 기술을 구축할 뿐만 아니라, 기존 방법보다 훨씬 더 다양하고 독창적인 공격을 생성한다는 것을 입증했습니다. 재택 브레이크 방법의 진화적 합성이라는 새로운 연구 방향을 촉진하기 위해 우리는 이 프레임워크를 공개합니다. 코드는 https://github.com/dongdongunique/EvoSynth에서 확인할 수 있습니다.

English

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce EvoSynth, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.

방법을 진화시키되 프롬프트는 진화시키지 말라: LLM에 대한 재닉스 공격의 진화적 합성

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

초록

Support