语言模型中的简洁约束逆转性能层级
Brevity Constraints Reverse Performance Hierarchies in Language Models
March 11, 2026
作者: MD Azizul Hakim
cs.AI
摘要
标准评估流程揭示了一个反直觉现象:在横跨五个数据集的7.7%基准问题上,尽管参数量增加10-100倍,大型语言模型的表现反而比小型模型低28.4个百分点。通过对31个模型(0.5B-405B参数)在1,485个问题上的系统评估,我们发现其机制在于自发性的规模依赖型赘述——模型因过度阐述而引入错误。因果干预实验表明,这反映的是可修正的提示设计缺陷,而非根本性能力局限。强制大型模型生成简洁回答可使准确率提升26个百分点,并将性能差距缩小多达三分之二。最关键的是,简洁性约束完全逆转了数学推理和科学知识基准测试中的性能等级:大型模型相较小型模型获得7.7-15.9个百分点的优势——这与原始差距形成直接反转。这些逆转证明大型模型具有被通用提示掩盖的更强潜在能力。我们通过三项独立污染测试验证了结论,并证明逆向缩放现象在整个参数谱系中持续存在,不同数据集的最佳参数规模介于0.5B至3.0B之间。我们的研究结果表明,最大化大型模型性能需要采用规模感知的提示工程而非通用评估方案,这对实际部署具有即时意义:提示适配既能提升准确率又可降低计算成本。
English
Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models -- direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.