REST: 다중 문제 동시 제시를 통한 대규모 추론 모델의 스트레스 테스트

초록

최근 대규모 추론 모델(Large Reasoning Models, LRMs)은 특정 작업에 대한 벤치마크에서 놀라운 성과를 거두었지만, 그 평가 방법은 여전히 고립된 문제 해결 패러다임에 의해 제한되고 있다. 기존 벤치마크는 주로 순차적 테스트를 통해 단일 질문 추론을 평가하는데, 이는 다음과 같은 중요한 한계를 초래한다: (1) 데이터 오염에 취약하고 도전적이지 않은 문제(예: DeepSeek-R1이 MATH500에서 97.0% 달성)로 인해 새로운 질문을 지속적으로 생성해야 하며, 이는 많은 인적 노력을 요구한다. (2) 실제 세계 배포에 필수적인 다중 문맥 압력 하에서 모델을 평가하지 못한다. 이러한 격차를 해소하기 위해, 우리는 REST(Reasoning Evaluation through Simultaneous Testing)를 제안한다. REST는 LRMs를 동시에 여러 문제에 노출시키는 스트레스 테스트 프레임워크이다. 기본 추론 능력 외에도, REST는 특히 문맥적 우선순위 할당, 교차 문제 간섭 저항, 동적 인지 부하 관리와 같은 평가가 충분히 이루어지지 않은 능력을 평가한다. 우리의 평가 결과는 다음과 같은 주목할 만한 발견을 보여준다: DeepSeek-R1과 같은 최첨단(SOTA) 모델도 스트레스 테스트 하에서 상당한 성능 저하를 보인다. 중요한 것은, REST가 기존 벤치마크보다 더 강력한 판별력을 보여주며, 단일 질문 평가에서 유사한 천장 성능을 보이는 모델들 사이에서도 뚜렷한 성능 차이를 드러낸다는 점이다. 우리의 분석에서 몇 가지 중요한 기계적 통찰이 도출되었다: (1) "과도한 사고 함정"이 성능 저하에 중요한 요인으로 작용한다. (2) "long2short" 기술로 훈련된 모델들은 REST 하에서도 단일 문제 성능의 정확도를 더 잘 유지하며, 표준 훈련 모델을 능가한다. 이러한 결과는 REST가 실제 세계의 추론 요구를 더 잘 반영하면서도 지속적인 인간 주석에 대한 의존도를 줄이는 비용 효율적이고 미래 지향적인 평가 패러다임임을 입증한다.

English

Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetual creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that concurrently exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST specifically evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key mechanistic insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation.

REST: 다중 문제 동시 제시를 통한 대규모 추론 모델의 스트레스 테스트

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

초록

Support