ChatPaper.aiChatPaper

当基准缺失:无真实标签条件下LLM安全评分比较的验证方法

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

May 7, 2026
作者: Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bjørklund, Leon Moonen, Klas Pettersen, Michael A. Riegler
cs.AI

摘要

在许多部署场景中,当相关语言、领域或监管体系尚未建立标注基准时,必须对候选语言模型进行安全性比较。我们将此情境形式化为无基准比较性安全评分,并明确了基于场景的审计可作为部署证据的契约条件。评分仅在固定场景包、评估标准、审计员、评判员、抽样配置和重运行预算下有效。由于缺乏标注数据,我们采用工具效度链替代真实一致性:包括对受控安全态与消除态对比的响应度、目标驱动方差相对于审计员与评判员人为误差的主导性,以及跨重运行的稳定性。 我们在本地优先的评分工具SimpleAudit中实例化了该效度链,并基于挪威语安全包进行验证。安全目标与消除目标以0.89至1.00的AUROC值实现分离,目标身份是主要方差来源(η²≈0.52),且严重性分布在十次重运行后趋于稳定。将同一效度链应用于Petri工具表明其可同时兼容两种工具。实质性差异产生于效度链上游的声明-契约执行与部署适配环节。挪威公共部门采购案例中对比Borealis与Gemma 3模型的结果显示:更安全的模型取决于场景类别和风险度量指标。因此,评分、匹配差值、临界比率、不确定性及所用审计员与评判员信息必须共同报告,而非简化为单一排名。
English
Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component (η^2 approx 0.52), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.
PDF12May 9, 2026