进化方法而非提示：针对大型语言模型越狱攻击的进化式合成

摘要

针对大型语言模型（LLMs）的自动化红队测试框架已日趋精密，但它们存在一个根本性局限：其越狱逻辑仅局限于选择、组合或优化既有攻击策略。这种约束限制了框架的创造性，使其无法自主发明全新的攻击机制。为突破这一局限，我们提出EvoSynth——一种将范式从攻击规划转变为越狱方法进化合成的自主框架。与优化提示词不同，EvoSynth采用多智能体系统自主设计、进化并执行基于代码的新型攻击算法。其核心特性在于代码级自我修正循环，能够根据失败反馈迭代重写自身攻击逻辑。通过大量实验，我们证明EvoSynth不仅在对Claude-Sonnet-4.5等高鲁棒性模型的测试中达到85.5%的攻击成功率（ASR），刷新当前最佳水平，而且生成的攻击方法多样性显著超越现有技术。我们开源此框架以促进越狱方法进化合成这一新方向的研究。代码地址：https://github.com/dongdongunique/EvoSynth。

English

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce EvoSynth, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.