REST：通过同时提出多个问题对大型推理模型进行压力测试

摘要

近期的大型推理模型（LRMs）在特定任务基准测试中取得了显著进展，然而其评估方法仍受限于孤立的问题解决范式。现有基准测试主要通过顺序测试来评估单一问题的推理能力，导致两个关键局限：（1）易受数据污染影响且挑战性不足（例如，DeepSeek-R1在MATH500上达到97.0%），迫使需要耗费大量人力持续创建新问题；（2）无法在多情境压力下评估模型，而这是实际部署中的关键要求。为弥补这一差距，我们提出了REST（通过同步测试进行推理评估），一个压力测试框架，能够同时向LRMs呈现多个问题。除基本推理外，REST特别评估了几项未充分测试的能力：情境优先级分配、跨问题干扰抵抗以及动态认知负荷管理。我们的评估揭示了几项引人注目的发现：即使是像DeepSeek-R1这样的最先进（SOTA）模型，在压力测试下也表现出显著的性能下降。重要的是，REST展现出比现有基准更强的区分能力，揭示了在单一问题评估中表现相近、接近天花板的模型之间的显著性能差异。我们的分析得出了一些关键机制性见解：（1）“过度思考陷阱”是导致性能下降的关键因素；（2）采用“长到短”技术训练的模型在REST下保持了更多单问题性能的准确性，优于标准训练的模型。这些结果确立了REST作为一种成本效益高、面向未来的评估范式，能更好地反映现实世界的推理需求，同时减少对持续人工标注的依赖。

English

Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key mechanistic insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation.

REST：通过同时提出多个问题对大型推理模型进行压力测试

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

摘要

Support