LLM瑞士轮赛制:基于竞争性瑞士系统动态的多基准性能整合
LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics
December 24, 2025
作者: Jiashuo Liu, Jiayun Wu, Chunjie Wu, Jingkai Liu, Zaiyuan Wang, Huan Zhou, Wenhao Huang, Hongseok Namkoong
cs.AI
摘要
大型语言模型(LLMs)的快速普及与多样化专业基准测试的涌现,亟需从碎片化的任务特定指标转向能够有效聚合多维度能力的整体竞争性排名体系。当前主要依赖静态评分的评估方法存在根本性局限:既难以确定不同基准测试间的合理混合比例,更无法捕捉模型在连续高风险任务中的动态竞争适应性及其脆弱性。为此,我们提出创新的竞争性瑞士制动态评估框架(CSD)。该框架通过多轮次序列化竞赛模拟,使模型根据累计胜负记录在精选基准序列中动态配对,并采用蒙特卡洛模拟(N=100,000次迭代)计算统计稳健的期望胜率得分(E[S_m]),以消除随机配对和早期轮次运气干扰。此外,我们通过参数化每轮淘汰数量(T_k)实施失败敏感性分析,从而根据风险偏好描绘模型特性——区分稳健通才型与激进专才型模型。实验证明,相较于传统聚合评分与静态配对模型,CSD能提供更精细且上下文感知的排名,标志着向风险感知的新一代LLM评估迈出关键一步。
English
The rapid proliferation of Large Language Models (LLMs) and diverse specialized benchmarks necessitates a shift from fragmented, task-specific metrics to a holistic, competitive ranking system that effectively aggregates performance across multiple ability dimensions. Primarily using static scoring, current evaluation methods are fundamentally limited. They struggle to determine the proper mix ratio across diverse benchmarks, and critically, they fail to capture a model's dynamic competitive fitness or its vulnerability when confronted with sequential, high-stakes tasks. To address this, we introduce the novel Competitive Swiss-System Dynamics (CSD) framework. CSD simulates a multi-round, sequential contest where models are dynamically paired across a curated sequence of benchmarks based on their accumulated win-loss record. And Monte Carlo Simulation (N=100,000 iterations) is used to approximate the statistically robust Expected Win Score (E[S_m]), which eliminates the noise of random pairing and early-round luck. Furthermore, we implement a Failure Sensitivity Analysis by parameterizing the per-round elimination quantity (T_k), which allows us to profile models based on their risk appetite--distinguishing between robust generalists and aggressive specialists. We demonstrate that CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models, representing a vital step towards risk-informed, next-generation LLM evaluation.