TeleMath：电信数学问题解决中大语言模型的基准测试

摘要

人工智慧在電信領域的日益普及，激發了人們對大型語言模型（LLMs）處理特定領域、數學密集型任務能力的興趣。儘管近期的進展已提升了LLMs在一般數學推理上的表現，但其在信號處理、網絡優化及性能分析等專業領域的有效性仍大多未被探索。為填補此一空白，我們引入了TeleMath，這是首個專門設計用於評估LLMs在電信領域解決具有數值解數學問題能力的基準數據集。TeleMath包含500組問答對，涵蓋了電信領域的廣泛主題。本文概述了從由領域專家精心挑選的問題種子開始，所提出的問答對生成流程。對一系列開源LLMs的評估顯示，TeleMath上的最佳表現由近期專為數學或邏輯推理設計的模型達成。相比之下，通用模型，即便是那些擁有大量參數的模型，往往在這些挑戰面前顯得力不從心。我們已發布數據集及評估代碼，以簡化結果的再現性並支持未來的研究。

English

The increasing adoption of artificial intelligence in telecommunications has raised interest in the capability of Large Language Models (LLMs) to address domain-specific, mathematically intensive tasks. Although recent advancements have improved the performance of LLMs in general mathematical reasoning, their effectiveness within specialized domains, such as signal processing, network optimization, and performance analysis, remains largely unexplored. To address this gap, we introduce TeleMath, the first benchmark dataset specifically designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain. Comprising 500 question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the telecommunications field. This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts. The evaluation of a wide range of open-source LLMs reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning. In contrast, general-purpose models, even those with a large number of parameters, often struggle with these challenges. We have released the dataset and the evaluation code to ease result reproducibility and support future research.

TeleMath：电信数学问题解决中大语言模型的基准测试

TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving

摘要

Support