ChatPaper.aiChatPaper

MathHay:一种用于LLM中长文本数学推理的自动化基准测试

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

October 7, 2024
作者: Lei Wang, Shan Dong, Yuhui Xu, Hanze Dong, Yalu Wang, Amrita Saha, Ee-Peng Lim, Caiming Xiong, Doyen Sahoo
cs.AI

摘要

最近的大型语言模型(LLMs)展示了在长文本情境中的多功能能力。尽管一些最近的基准已经被开发用于评估LLMs的长文本能力,但缺乏评估LLMs在长文本情境下数学推理能力的基准,这对于LLMs在实际场景中的应用至关重要。在本文中,我们介绍了MathHay,一个旨在评估LLMs长文本数学推理能力的自动化基准。与之前的基准(如“草堆中的针”)不同,后者主要关注长文本中的信息检索,MathHay要求模型具备信息搜索和复杂数学推理能力。我们在MathHay上进行了大量实验,评估了八个表现最佳的LLMs的长文本数学推理能力。即使是表现最佳的模型Gemini-1.5-Pro-002,在长文本数学推理方面仍然存在困难,在128K个标记时仅达到51.26%的准确率。这突显了在MathHay基准上有很大的改进空间。
English
Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we introduce MathHay, an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. Unlike previous benchmarks like Needle in a Haystack, which focus primarily on information retrieval within long texts, MathHay demands models with both information-seeking and complex mathematical reasoning abilities. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing LLMs. Even the best-performing model, Gemini-1.5-Pro-002, still struggles with mathematical reasoning over long contexts, achieving only 51.26% accuracy at 128K tokens. This highlights the significant room for improvement on the MathHay benchmark.

Summary

AI-Generated Summary

PDF133November 16, 2024