WirelessMathLM：運用強化學習在無線通訊領域教授大型語言模型的數學推理

摘要

大型語言模型（LLMs）在通用數學推理方面表現出色，但在專業技術數學領域卻表現得極為糟糕。在無線通信領域，問題需要精確處理信息理論界限、優化約束和信號處理公式，即使是最先進的模型也難以達到合格的水平。我們提出了WirelessMathLM，展示了通過領域特定的強化學習與可驗證的獎勵機制，緊湊型模型（0.5B-7B參數）能夠匹配甚至超越更大的模型。我們的關鍵洞察是，無線數學問題具有一個獨特的屬性——可驗證的正確性——這使得無需人類反饋即可進行有效的強化學習。我們構建了WirelessMathBench-XL，這是一個包含970篇論文中4,027個問題的綜合基準。使用基於二元驗證獎勵的群體相對策略優化（GRPO），我們直接從基礎檢查點訓練模型，無需監督式熱啟動。我們的7B模型在WirelessMathBench-XL上達到了39.5%的準確率，接近GPT-4o（40.4%），而參數數量僅為DeepSeek-R1（671B，57.4%）的約百分之一。值得注意的是，GRPO訓練在所有模型規模上幾乎都將性能提升了一倍（0.5B +11%，3B +103%，7B +81%），並且在通用數學基準上表現出正向遷移——我們的模型在MATH、Minerva-Math、OlympiadBench、AMC和AIME等任務上平均提升了+8.4分，而無需對這些任務進行任何訓練。

English

Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance. We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property--verifiable correctness--that enables effective reinforcement learning without human feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start. Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B +81%), with positive transfer to general mathematics benchmarks--our models gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME without any training on these tasks.

WirelessMathLM：運用強化學習在無線通訊領域教授大型語言模型的數學推理

WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning

摘要

Support