进化方法而非提示：针对大型语言模型的越狱攻击之演化合成

摘要

针对大型语言模型（LLM）的自动化红队测试框架日益精进，但其存在一个根本性局限：其越狱逻辑仅局限于选择、组合或优化既有的攻击策略。这束缚了框架的创造力，使其无法自主发明全新的攻击机制。为突破这一局限，我们提出EvoSynth——一种将范式从攻击规划转向越狱方法进化式合成的自主框架。该框架采用多智能体系统，通过代码自主设计、进化并执行新型攻击算法，而非仅优化提示词。其核心特性在于代码级的自我修正循环机制，使系统能根据失败反馈迭代重写攻击逻辑。大量实验表明，EvoSynth不仅在对Claude-Sonnet-4.5等高鲁棒性模型的测试中达到85.5%的攻击成功率（ASR），刷新当前最优水平，其生成的攻击多样性也显著超越现有方法。我们开源此框架以推动越狱方法进化式合成这一新方向的研究。代码地址：https://github.com/dongdongunique/EvoSynth。

English

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce EvoSynth, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.

进化方法而非提示：针对大型语言模型的越狱攻击之演化合成

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

摘要

Support