REST:通过同时提出多个问题对大型推理模型进行压力测试
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
July 14, 2025
作者: Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu
cs.AI
摘要
近期的大型推理模型(LRMs)在特定任务基准测试中取得了显著进展,然而其评估方法仍受限于孤立的问题解决范式。现有基准测试主要通过顺序测试来评估单一问题的推理能力,导致两个关键局限:(1)易受数据污染影响且挑战性不足(例如,DeepSeek-R1在MATH500上达到97.0%),迫使需要耗费大量人力持续创建新问题;(2)无法在多情境压力下评估模型,而这是实际部署中的关键要求。为弥补这一差距,我们提出了REST(通过同步测试进行推理评估),一个压力测试框架,能够同时向LRMs呈现多个问题。除基本推理外,REST特别评估了几项未充分测试的能力:情境优先级分配、跨问题干扰抵抗以及动态认知负荷管理。我们的评估揭示了几项引人注目的发现:即使是像DeepSeek-R1这样的最先进(SOTA)模型,在压力测试下也表现出显著的性能下降。重要的是,REST展现出比现有基准更强的区分能力,揭示了在单一问题评估中表现相近、接近天花板的模型之间的显著性能差异。我们的分析得出了一些关键机制性见解:(1)“过度思考陷阱”是导致性能下降的关键因素;(2)采用“长到短”技术训练的模型在REST下保持了更多单问题性能的准确性,优于标准训练的模型。这些结果确立了REST作为一种成本效益高、面向未来的评估范式,能更好地反映现实世界的推理需求,同时减少对持续人工标注的依赖。
English
Recent Large Reasoning Models (LRMs) have achieved remarkable progress on
task-specific benchmarks, yet their evaluation methods remain constrained by
isolated problem-solving paradigms. Existing benchmarks predominantly assess
single-question reasoning through sequential testing, resulting critical
limitations: (1) vulnerability to data contamination and less challenging
(e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetual
creation of new questions with large human efforts, (2) failure to evaluate
models under multi-context pressure, a key requirement for real-world
deployment. To bridge this gap, we present REST (Reasoning Evaluation through
Simultaneous Testing), a stress-testing framework that concurrently exposes
LRMs to multiple problems simultaneously. Beyond basic reasoning, REST
specifically evaluates several under-tested capabilities: contextual priority
allocation, cross-problem interference resistance, and dynamic cognitive load
management. Our evaluation reveals several striking findings: Even
state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance
degradation under stress testing. Crucially, REST demonstrates stronger
discriminative power than existing benchmarks, revealing pronounced performance
differences among models that exhibit similar, near-ceiling performance under
single-question evaluations. Some key mechanistic insights emerge from our
analysis: (1) the "overthinking trap" is a critical factor contributing to the
performance degradation; (2) the models trained with "long2short" technique
preserve more accuracy of their single-problem performance under REST,
outperforming standard-trained counterparts. These results establish REST as a
cost-efficient, future-proof evaluation paradigm that better reflects
real-world reasoning demands while reducing reliance on continuous human
annotation.