MART: Mejorando la Seguridad de los Modelos de Lenguaje con Pruebas Automáticas de Resistencia en Múltiples Rondas

Resumen

El red-teaming es una práctica común para mitigar comportamientos inseguros en los Modelos de Lenguaje de Gran Escala (LLMs, por sus siglas en inglés), que implica evaluar exhaustivamente los LLMs para identificar posibles fallos y abordarlos con respuestas responsables y precisas. Aunque es efectivo, el red-teaming manual es costoso, y el red-teaming automático existente suele descubrir riesgos de seguridad sin resolverlos. En este artículo, proponemos un método de Red-Teaming Automático Multironda (MART, por sus siglas en inglés), que incorpora tanto la escritura automática de indicaciones adversarias como la generación de respuestas seguras, aumentando significativamente la escalabilidad del red-teaming y la seguridad del LLM objetivo. Específicamente, un LLM adversario y un LLM objetivo interactúan entre sí de manera iterativa, donde el LLM adversario tiene como objetivo generar indicaciones desafiantes que provoquen respuestas inseguras del LLM objetivo, mientras que el LLM objetivo se ajusta con datos alineados con la seguridad en estas indicaciones adversarias. En cada ronda, el LLM adversario elabora mejores ataques sobre el LLM objetivo actualizado, mientras que el LLM objetivo también mejora a través del ajuste de seguridad. En los puntos de referencia de indicaciones adversarias, la tasa de violación de un LLM con alineación de seguridad limitada se reduce hasta un 84.7% después de 4 rondas de MART, alcanzando un rendimiento comparable al de los LLMs con una amplia escritura de indicaciones adversarias. Cabe destacar que la utilidad del modelo en indicaciones no adversarias se mantiene estable a lo largo de las iteraciones, lo que indica que el LLM objetivo mantiene un fuerte rendimiento en el seguimiento de instrucciones.

English

Red-teaming is a common practice for mitigating unsafe behaviors in Large Language Models (LLMs), which involves thoroughly assessing LLMs to identify potential flaws and addressing them with responsible and accurate responses. While effective, manual red-teaming is costly, and existing automatic red-teaming typically discovers safety risks without addressing them. In this paper, we propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation, significantly increasing red-teaming scalability and the safety of the target LLM. Specifically, an adversarial LLM and a target LLM interplay with each other in an iterative manner, where the adversarial LLM aims to generate challenging prompts that elicit unsafe responses from the target LLM, while the target LLM is fine-tuned with safety aligned data on these adversarial prompts. In each round, the adversarial LLM crafts better attacks on the updated target LLM, while the target LLM also improves itself through safety fine-tuning. On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART, achieving comparable performance to LLMs with extensive adversarial prompt writing. Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.

MART: Mejorando la Seguridad de los Modelos de Lenguaje con Pruebas Automáticas de Resistencia en Múltiples Rondas

MART: Improving LLM Safety with Multi-round Automatic Red-Teaming

Resumen

Support