稳定生成流网络：通过对比轨迹平衡实现多样化且鲁棒的大语言模型红队测试

摘要

大型语言模型（LLM）红队测试通过主动识别模型漏洞，是保障安全的关键环节。如何在测试中实现高效且多样化的攻击至关重要，但二者兼得颇具挑战性。基于分布匹配的生成流网络（GFN）虽前景广阔，却因训练不稳定和模式坍塌问题而闻名。红队测试中不稳定的奖励机制会加剧模式坍塌。我们提出稳定生成流网络（S-GFN），通过消除GFN中的配分函数Z估计来提升训练稳定性。S-GFN采用成对比较法规避Z估计，并运用抗噪声奖励的鲁棒掩码方法。此外，我们引入流畅度稳定器以防止模型陷入生成无意义文本的局部最优解。S-GFN在保持GFN最优策略的同时实现了更稳定的训练。实验表明，S-GFN在不同场景下均展现出卓越的攻击性能与多样性。

English

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.

稳定生成流网络：通过对比轨迹平衡实现多样化且鲁棒的大语言模型红队测试

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

摘要

Support