Evoluir o Método, Não os Prompts: Síntese Evolutiva de Ataques de Jailbreak em LLMs

Resumo

Os frameworks automatizados de testes de invasão (red teaming) para Modelos de Linguagem de Grande Porte (LLMs) tornaram-se cada vez mais sofisticados, mas compartilham uma limitação fundamental: sua lógica de violação (jailbreak) está confinada à seleção, combinação ou refinamento de estratégias de ataque pré-existentes. Isso limita sua criatividade e impede que inventem autonomamente mecanismos de ataque totalmente novos. Para superar essa lacuna, introduzimos o EvoSynth, um framework autónomo que muda o paradigma do planeamento de ataques para a síntese evolutiva de métodos de violação. Em vez de refinar instruções (prompts), o EvoSynth emprega um sistema multiagente para projetar, evoluir e executar autónomamente novos algoritmos de ataque baseados em código. Crucialmente, ele apresenta um ciclo de autocorreção a nível de código, permitindo que reescreva iterativamente sua própria lógica de ataque em resposta a falhas. Através de experiências extensivas, demonstramos que o EvoSynth não apenas estabelece um novo estado da arte ao alcançar uma Taxa de Sucesso de Ataque (ASR) de 85,5% contra modelos altamente robustos como o Claude-Sonnet-4.5, mas também gera ataques significativamente mais diversificados do que os métodos existentes. Disponibilizamos nosso framework para facilitar pesquisas futuras nesta nova direção de síntese evolutiva de métodos de violação. O código está disponível em: https://github.com/dongdongunique/EvoSynth.

English

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce EvoSynth, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: https://github.com/dongdongunique/EvoSynth.