Stable-GFlowNet: 대비적 궤적 균형을 통한 다양하고 강력한 LLM 레드팀링 접근법

초록

대규모 언어 모델(LLM) 레드 팀링은 LLM의 취약점을 사전에 발견하여 안전성을 확보하기 위한 필수 과정입니다. 레드 팀링에서 효과적이고 다양한 공격을 찾는 것은 중요하지만, 두 가지를 동시에 달성하는 것은 어려운 과제입니다. 분포 매칭을 수행하는 생성적 흐름 네트워크(GFN)는 유망한 방법이지만, 학습 불안정성과 모드 붕괴로 악명 높습니다. 특히 레드 팀링에서 불안정한 보상은 모드 붕괴를 가속화합니다. 본 연구에서는 GFN의 분할 함수 Z 추정을 제거하여 학습 불안정성을 줄인 Stable-GFN(S-GFN)을 제안합니다. S-GFN은 쌍별 비교를 통해 Z 추정을 회피하고, 노이즈가 많은 보상에 강건한 마스킹 방법론을 채택합니다. 또한, 의미 없는 문장을 생성하는 지역 최적점에 모델이 갇히는 것을 방지하기 위한 유창성 안정화 장치를 제안합니다. S-GFN은 GFN의 최적 정책을 유지하면서 더 안정적인 학습을 제공합니다. 다양한 환경에서 S-GFN의 압도적인 공격 성능과 다양성을 입증합니다.

English

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.

Stable-GFlowNet: 대비적 궤적 균형을 통한 다양하고 강력한 LLM 레드팀링 접근법

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

초록

Support