TeleMath：面向电信领域数学问题解决的大语言模型基准测试

摘要

人工智能在电信领域的日益普及，引发了人们对大型语言模型（LLMs）处理特定领域、数学密集型任务能力的浓厚兴趣。尽管近期的技术进步已提升了LLMs在通用数学推理方面的表现，但它们在信号处理、网络优化及性能分析等专业领域内的有效性仍鲜有探索。为填补这一空白，我们推出了TeleMath，这是首个专门设计用于评估LLMs在电信领域解决具有数值解的数学问题性能的基准数据集。TeleMath包含500个问答对，覆盖了电信领域的广泛主题。本文详述了从由领域专家精心挑选的问题种子出发，构建问答对的生成流程。通过对一系列开源LLMs的评估发现，在TeleMath上表现最佳的是那些专为数学或逻辑推理设计的最新模型。相比之下，即便是参数规模庞大的通用模型，面对这些挑战也常感力不从心。我们已公开发布该数据集及评估代码，以简化结果复现过程，支持未来研究。

English

The increasing adoption of artificial intelligence in telecommunications has raised interest in the capability of Large Language Models (LLMs) to address domain-specific, mathematically intensive tasks. Although recent advancements have improved the performance of LLMs in general mathematical reasoning, their effectiveness within specialized domains, such as signal processing, network optimization, and performance analysis, remains largely unexplored. To address this gap, we introduce TeleMath, the first benchmark dataset specifically designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain. Comprising 500 question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the telecommunications field. This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts. The evaluation of a wide range of open-source LLMs reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning. In contrast, general-purpose models, even those with a large number of parameters, often struggle with these challenges. We have released the dataset and the evaluation code to ease result reproducibility and support future research.

TeleMath：面向电信领域数学问题解决的大语言模型基准测试

TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving

摘要

Support