ChatPaper.aiChatPaper

LLM瑞士轮赛制:通过竞争性瑞士系统动态整合多基准测试表现

LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

December 24, 2025
作者: Jiashuo Liu, Jiayun Wu, Chunjie Wu, Jingkai Liu, Zaiyuan Wang, Huan Zhou, Wenhao Huang, Hongseok Namkoong
cs.AI

摘要

大型语言模型(LLMs)的快速普及与多样化专业基准测试的涌现,亟需从碎片化的任务特定指标转向能够有效聚合多维度能力表现的整体性竞争排名体系。当前主要采用静态评分的评估方法存在根本性局限性:既难以确定跨基准测试的合理混合比例,更无法捕捉模型在连续高风险任务中的动态竞争适应性及其脆弱性。为此,我们提出创新的竞争性瑞士制动态评估框架(CSD)。该框架通过模拟多轮次序列竞赛,使模型根据累积胜负记录在精选基准测试序列中实现动态配对,并采用蒙特卡洛模拟(N=100,000次迭代)来估算统计稳健的期望胜率得分(E[S_m]),从而消除随机配对和早期轮次运气因素的干扰。此外,我们通过参数化每轮淘汰数量(T_k)实施失效敏感性分析,据此构建模型的风险偏好画像——区分稳健通才型与激进专才型模型。实证表明,相较于传统聚合评分与静态配对模型,CSD能提供更精细且情境感知的排名结果,标志着向风险感知的新一代LLM评估迈出关键一步。
English
The rapid proliferation of Large Language Models (LLMs) and diverse specialized benchmarks necessitates a shift from fragmented, task-specific metrics to a holistic, competitive ranking system that effectively aggregates performance across multiple ability dimensions. Primarily using static scoring, current evaluation methods are fundamentally limited. They struggle to determine the proper mix ratio across diverse benchmarks, and critically, they fail to capture a model's dynamic competitive fitness or its vulnerability when confronted with sequential, high-stakes tasks. To address this, we introduce the novel Competitive Swiss-System Dynamics (CSD) framework. CSD simulates a multi-round, sequential contest where models are dynamically paired across a curated sequence of benchmarks based on their accumulated win-loss record. And Monte Carlo Simulation (N=100,000 iterations) is used to approximate the statistically robust Expected Win Score (E[S_m]), which eliminates the noise of random pairing and early-round luck. Furthermore, we implement a Failure Sensitivity Analysis by parameterizing the per-round elimination quantity (T_k), which allows us to profile models based on their risk appetite--distinguishing between robust generalists and aggressive specialists. We demonstrate that CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models, representing a vital step towards risk-informed, next-generation LLM evaluation.
PDF01December 26, 2025