ChatPaper.aiChatPaper

基于最优N采样策略的大语言模型对抗性风险统计估计

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

January 30, 2026
作者: Mingqian Feng, Xiaodong Liu, Weiwei Yang, Chenliang Xu, Christopher White, Jianfeng Gao
cs.AI

摘要

大型語言模型(LLMs)的安全性評估通常基於單次或低預算的對抗性提示測試,這種方式會低估實際風險。實務上,攻擊者可利用大規模平行採樣技術反覆探測模型,直至生成有害回應。雖然近期研究表明攻擊成功率會隨重複採樣次數增加而上升,但預測大規模對抗性風險的系統化方法仍顯不足。我們提出一種具規模感知特性的N次最佳採樣風險評估框架SABER,用於建模N次最佳採樣情境下的越獄漏洞。透過採用伯努利分佈的共軛先驗——貝塔分佈對樣本級成功率進行建模,我們推導出可解析的規模化規律,能基於小規模預算測量值可靠外推大規模N值下的攻擊成功率。僅使用n=100個樣本時,我們的錨定估計器預測ASR@1000的平均絕對誤差為1.66,而基準方法誤差達12.04,估計誤差降低86.2%。研究結果揭示了異質化的風險規模化特徵,並表明在標準評估中表現穩健的模型,於平行對抗壓力下可能出現快速非線性風險放大。本研究為現實場景的LLM安全評估提供了低成本、可擴展的方法論。我們將在論文發表時公開相關程式碼與評估腳本,以助力後續研究。
English
Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.
PDF162February 3, 2026