MathHay: LLMs에서의 장문 매스 수학 추론을 위한 자동화된 벤치마크

초록

최근 대형 언어 모델(LLMs)은 긴 맥락 상황에서 다재다능한 능력을 보여주었습니다. 최근 일부 벤치마크는 LLMs의 긴 맥락 능력을 평가하기 위해 개발되었지만, LLMs의 수학적 추론 능력을 긴 맥락에서 평가하는 벤치마크가 부족합니다. 이는 LLMs의 실제 시나리오 적용에 중요합니다. 본 논문에서는 LLMs의 긴 맥락 수학적 추론 능력을 평가하기 위해 고안된 자동화된 벤치마크인 MathHay를 소개합니다. 이전의 Needle in a Haystack과 같은 벤치마크와 달리, MathHay는 주로 긴 텍스트 내 정보 검색에 초점을 맞추는 것이 아니라 정보 탐색과 복잡한 수학적 추론 능력을 모두 요구합니다. 우리는 MathHay에서 여덟 개의 성능이 우수한 LLMs의 긴 맥락 수학적 추론 능력을 평가하기 위해 광범위한 실험을 실시했습니다. 심지어 최고 성능을 보이는 모델인 Gemini-1.5-Pro-002도 여전히 긴 맥락에서의 수학적 추론에 어려움을 겪어, 128K 토큰에서 51.26%의 정확도만 달성했습니다. 이는 MathHay 벤치마크에서 개선할 여지가 많다는 점을 강조합니다.

English

Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights the significant room for improvement on the MathHay benchmark.

MathHay: LLMs에서의 장문 매스 수학 추론을 위한 자동화된 벤치마크

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

초록

Support