강건한 추론 벤치마크

초록

대규모 언어 모델(LLM)은 표준 수학 벤치마크에서 높은 성능을 달성하지만, 그들의 기본 추론 과정은 표준 텍스트 형식에 심하게 과적합된 상태로 남아 있습니다. 우리는 LLM 추론의 강건성을 평가하기 위해 14가지 기법으로 구성된 교란 파이프라인을 제안합니다. 이 파이프라인을 AIME 2024 데이터셋에 적용하고, 결과적인 벤치마크에서 8개의 최첨단 모델을 평가합니다. 최고 수준의 프론티어 모델들은 회복력을 보였지만, 오픈 웨이트 추론 모델들은 치명적인 성능 하락(교란 기법 전체에 걸쳐 평균 정확도 최대 55% 하락, 일부에서는 최대 100% 하락)을 겪어 구조적 취약성을 드러냈습니다. 기계적 구문 분석 실패와 하위 추론 실패를 추가로 분리하기 위해, 우리는 단일 컨텍스트 창 내에서 모델이 여러 개의 비교란 수학 문제를 순차적으로 풀도록 강제하여 모델의 작업 기억 용량을 엄격하게 분리했습니다. 우리의 결과는 7B부터 120B 매개변수에 이르는 오픈 웨이트 모델들과 Claude Opus 4.6이 후속 문제에서 정확도 감소를 보인다는 것을 나타냅니다. 이러한 성능 저하는 중간 추론 단계들이 표준 조밀 어텐션 메커니즘을 영구적으로 오염시킨다는 것을 보여줍니다. 우리는 신뢰할 수 있는 추론을 달성하기 위해 미래의 추론 아키텍처는 모델 자체의 사고 사슬(Chain-of-Thought) 내에 명시적인 컨텍스트 재설정을 통합해야 하며, 이는 원자적 추론 과업의 최적 세분화에 관한 근본적인 공개 질문으로 이어진다고 주장합니다.

English

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models' working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model's own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.

강건한 추론 벤치마크

Robust Reasoning Benchmark

초록

Support