ChatPaper.aiChatPaper

基于最优N采样的大型语言模型对抗风险统计估计

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

January 30, 2026
作者: Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao
cs.AI

摘要

大型语言模型(LLM)的安全性评估通常采用单次或低预算对抗性提示测试,这往往低估了实际风险。实践中,攻击者可通过大规模并行采样反复探测模型直至生成有害回复。尽管近期研究表明攻击成功率会随重复采样而提升,但预测大规模对抗风险的原则性方法仍显不足。我们提出一种考虑规模效应的N选最优风险估计框架SABER,用于建模N选最优采样下的越狱漏洞。通过采用伯努利分布的共轭先验——贝塔分布对样本级成功概率建模,我们推导出可解析的缩放定律,能够基于小规模采样数据可靠地外推大规模(N)攻击成功率。仅需n=100个样本,我们的锚定估计器预测ASR@1000的绝对误差均值仅为1.66,较基线方法的12.04降低了86.2%。研究结果揭示了异构的风险缩放特征,表明在标准评估中表现稳健的模型可能在并行对抗压力下出现快速非线性风险放大。本工作为实际LLM安全评估提供了一种低成本、可扩展的方法论。我们将在论文发表时同步开源代码与评估脚本,以助力后续研究。
English
Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.
PDF162February 3, 2026