WirelessMathLM:運用強化學習在無線通訊領域教授大型語言模型的數學推理
WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning
September 27, 2025
作者: Xin Li, Mengbing Liu, Yiyang Zhu, Wenhe Zhang, Li Wei, Jiancheng An, Chau Yuen
cs.AI
摘要
大型語言模型(LLMs)在通用數學推理方面表現出色,但在專業技術數學領域卻表現得極為糟糕。在無線通信領域,問題需要精確處理信息理論界限、優化約束和信號處理公式,即使是最先進的模型也難以達到合格的水平。我們提出了WirelessMathLM,展示了通過領域特定的強化學習與可驗證的獎勵機制,緊湊型模型(0.5B-7B參數)能夠匹配甚至超越更大的模型。我們的關鍵洞察是,無線數學問題具有一個獨特的屬性——可驗證的正確性——這使得無需人類反饋即可進行有效的強化學習。我們構建了WirelessMathBench-XL,這是一個包含970篇論文中4,027個問題的綜合基準。使用基於二元驗證獎勵的群體相對策略優化(GRPO),我們直接從基礎檢查點訓練模型,無需監督式熱啟動。我們的7B模型在WirelessMathBench-XL上達到了39.5%的準確率,接近GPT-4o(40.4%),而參數數量僅為DeepSeek-R1(671B,57.4%)的約百分之一。值得注意的是,GRPO訓練在所有模型規模上幾乎都將性能提升了一倍(0.5B +11%,3B +103%,7B +81%),並且在通用數學基準上表現出正向遷移——我們的模型在MATH、Minerva-Math、OlympiadBench、AMC和AIME等任務上平均提升了+8.4分,而無需對這些任務進行任何訓練。
English
Large language models (LLMs) excel at general mathematical reasoning but fail
catastrophically on specialized technical mathematics. In wireless
communications, where problems require precise manipulation of
information-theoretic bounds, optimization constraints, and signal processing
formulations, even state-of-the-art models struggle to achieve competent
performance. We present WirelessMathLM, demonstrating that compact models
(0.5B-7B parameters) can match or exceed much larger models through
domain-specific reinforcement learning with verifiable rewards. Our key insight
is that wireless mathematics problems possess a unique property--verifiable
correctness--that enables effective reinforcement learning without human
feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027
problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with
binary verification rewards, we train models directly from base checkpoints
without supervised warm-start. Our 7B model achieves 39.5% accuracy on
WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times
fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training
nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B
+81%), with positive transfer to general mathematics benchmarks--our models
gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and
AIME without any training on these tasks.