无线数学语言模型（WirelessMathLM）：通过强化学习教授大语言模型在无线通信中的数学推理能力

摘要

大型语言模型（LLMs）在通用数学推理方面表现出色，但在专业数学领域却表现欠佳。在无线通信领域，问题往往需要精确处理信息论界限、优化约束和信号处理公式，即便是最先进的模型也难以胜任。我们提出了WirelessMathLM，展示了通过领域特定的强化学习与可验证奖励，紧凑模型（0.5B-7B参数）能够媲美甚至超越更大规模的模型。我们的核心洞见是，无线数学问题具备一个独特属性——可验证的正确性——这使得无需人类反馈即可实现有效的强化学习。我们构建了WirelessMathBench-XL，这是一个包含970篇论文中4,027个问题的综合基准。通过采用带有二元验证奖励的组相对策略优化（GRPO），我们直接从基础检查点训练模型，无需监督预热。我们的7B模型在WirelessMathBench-XL上达到了39.5%的准确率，接近GPT-4o（40.4%），而参数数量仅为DeepSeek-R1（671B，57.4%）的约百分之一。值得注意的是，GRPO训练在所有模型规模上几乎将性能翻倍（0.5B +11%，3B +103%，7B +81%），并且对通用数学基准产生了正向迁移——我们的模型在MATH、Minerva-Math、OlympiadBench、AMC和AIME上平均提升了8.4分，且未在这些任务上进行任何训练。

English

Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance. We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property--verifiable correctness--that enables effective reinforcement learning without human feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start. Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B +81%), with positive transfer to general mathematics benchmarks--our models gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME without any training on these tasks.

无线数学语言模型（WirelessMathLM）：通过强化学习教授大语言模型在无线通信中的数学推理能力

WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning

摘要

Support