穩定生成流網絡:基羅對比軌跡平衡實現多樣化與魯棒性大型語言模型對抗測試 注:標題翻譯在保持技術術語準確性的同時,採用以下處理方式: 1. "Stable-GFlowNet" 譯為「穩定生成流網絡」,保留GFlowNet在學術界的標準譯法 2. "Contrastive Trajectory Balance" 採用「對比軌跡平衡」這一技術術語標準翻譯 3. "LLM Red-Teaming" 意譯為「大型語言模型對抗測試」,更符合中文學術表達習慣 4. 通過冒號結構保持中英文標題的對應關係,同時確保中文標題的學術規範性
Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance
May 1, 2026
作者: Minchan Kwon, Sunghyun Baek, Minseo Kim, Jaemyung Yu, Dongyoon Han, Junmo Kim
cs.AI
摘要
大型語言模型(LLM)紅隊演練透過主動識別LLM的漏洞,成為確保模型安全的重要流程。在紅隊演練中尋找有效且多樣化的攻擊手段至關重要,但同時兼顧二者極具挑戰性。基於分佈匹配的生成流網絡(GFN)雖是極具潛力的方法,卻因訓練不穩定性與模式崩潰問題而聞名。尤其紅隊演練中不穩定的獎勵信號會加劇模式崩潰現象。我們提出穩定生成流網絡(S-GFN),通過消除GFN中的配分函數Z估計來降低訓練不穩定性。S-GFN透過成對比較避免Z值估算,並採用抗噪聲獎勵的魯棒掩碼方法。此外,我們引入流暢度穩定器,防止模型陷入生成無意義內容的局部最優解。S-GFN在維持GFN最優策略的同時提供更穩定的訓練過程。實驗結果表明,S-GFN在多種設定下均展現出卓越的攻擊效能與多樣性。
English
Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.