穩定生成流網絡：基羅對比軌跡平衡實現多樣化與魯棒性大型語言模型對抗測試注：標題翻譯在保持技術術語準確性的同時，採用以下處理方式： 1. "Stable-GFlowNet" 譯為「穩定生成流網絡」，保留GFlowNet在學術界的標準譯法 2. "Contrastive Trajectory Balance" 採用「對比軌跡平衡」這一技術術語標準翻譯 3. "LLM Red-Teaming" 意譯為「大型語言模型對抗測試」，更符合中文學術表達習慣 4. 通過冒號結構保持中英文標題的對應關係，同時確保中文標題的學術規範性

摘要

大型語言模型（LLM）紅隊演練透過主動識別LLM的漏洞，成為確保模型安全的重要流程。在紅隊演練中尋找有效且多樣化的攻擊手段至關重要，但同時兼顧二者極具挑戰性。基於分佈匹配的生成流網絡（GFN）雖是極具潛力的方法，卻因訓練不穩定性與模式崩潰問題而聞名。尤其紅隊演練中不穩定的獎勵信號會加劇模式崩潰現象。我們提出穩定生成流網絡（S-GFN），通過消除GFN中的配分函數Z估計來降低訓練不穩定性。S-GFN透過成對比較避免Z值估算，並採用抗噪聲獎勵的魯棒掩碼方法。此外，我們引入流暢度穩定器，防止模型陷入生成無意義內容的局部最優解。S-GFN在維持GFN最優策略的同時提供更穩定的訓練過程。實驗結果表明，S-GFN在多種設定下均展現出卓越的攻擊效能與多樣性。

English

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

摘要

Support