无线数学语言模型(WirelessMathLM):通过强化学习教授大语言模型在无线通信中的数学推理能力
WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning
September 27, 2025
作者: Xin Li, Mengbing Liu, Yiyang Zhu, Wenhe Zhang, Li Wei, Jiancheng An, Chau Yuen
cs.AI
摘要
大型语言模型(LLMs)在通用数学推理方面表现出色,但在专业数学领域却表现欠佳。在无线通信领域,问题往往需要精确处理信息论界限、优化约束和信号处理公式,即便是最先进的模型也难以胜任。我们提出了WirelessMathLM,展示了通过领域特定的强化学习与可验证奖励,紧凑模型(0.5B-7B参数)能够媲美甚至超越更大规模的模型。我们的核心洞见是,无线数学问题具备一个独特属性——可验证的正确性——这使得无需人类反馈即可实现有效的强化学习。我们构建了WirelessMathBench-XL,这是一个包含970篇论文中4,027个问题的综合基准。通过采用带有二元验证奖励的组相对策略优化(GRPO),我们直接从基础检查点训练模型,无需监督预热。我们的7B模型在WirelessMathBench-XL上达到了39.5%的准确率,接近GPT-4o(40.4%),而参数数量仅为DeepSeek-R1(671B,57.4%)的约百分之一。值得注意的是,GRPO训练在所有模型规模上几乎将性能翻倍(0.5B +11%,3B +103%,7B +81%),并且对通用数学基准产生了正向迁移——我们的模型在MATH、Minerva-Math、OlympiadBench、AMC和AIME上平均提升了8.4分,且未在这些任务上进行任何训练。
English
Large language models (LLMs) excel at general mathematical reasoning but fail
catastrophically on specialized technical mathematics. In wireless
communications, where problems require precise manipulation of
information-theoretic bounds, optimization constraints, and signal processing
formulations, even state-of-the-art models struggle to achieve competent
performance. We present WirelessMathLM, demonstrating that compact models
(0.5B-7B parameters) can match or exceed much larger models through
domain-specific reinforcement learning with verifiable rewards. Our key insight
is that wireless mathematics problems possess a unique property--verifiable
correctness--that enables effective reinforcement learning without human
feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027
problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with
binary verification rewards, we train models directly from base checkpoints
without supervised warm-start. Our 7B model achieves 39.5% accuracy on
WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times
fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training
nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B
+81%), with positive transfer to general mathematics benchmarks--our models
gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and
AIME without any training on these tasks.