WirelessMathLM：無線通信における大規模言語モデルの数学的推論能力を強化学習で指導

要旨

大規模言語モデル（LLM）は一般的な数学的推論において優れた性能を発揮しますが、専門的な技術数学では壊滅的な失敗を起こします。無線通信分野では、情報理論的限界の正確な操作、最適化制約、信号処理の定式化を必要とする問題において、最先端のモデルでさえも十分な性能を達成するのに苦労しています。本論文では、WirelessMathLMを紹介し、ドメイン固有の強化学習と検証可能な報酬を用いることで、コンパクトなモデル（0.5B-7Bパラメータ）がはるかに大規模なモデルに匹敵またはそれを上回る性能を発揮できることを示します。私たちの重要な洞察は、無線数学の問題が持つ「検証可能な正しさ」という独自の特性が、人間のフィードバックなしで効果的な強化学習を可能にするという点です。970の論文から4,027の問題を集めた包括的なベンチマーク、WirelessMathBench-XLを構築しました。二値検証報酬を用いたGroup Relative Policy Optimization（GRPO）により、教師ありウォームスタートなしでベースチェックポイントから直接モデルを訓練しました。7BモデルはWirelessMathBench-XLで39.5%の精度を達成し、GPT-4o（40.4%）に近づきながら、DeepSeek-R1（671B、57.4%）の約100分の1のパラメータ数で動作します。注目すべきは、GRPO訓練がすべてのモデル規模で性能をほぼ倍増させたことです（0.5B +11%、3B +103%、7B +81%）。さらに、一般的な数学ベンチマークへの正の転移も観察され、MATH、Minerva-Math、OlympiadBench、AMC、AIMEにおいて、これらのタスクでの訓練なしに平均+8.4ポイントの向上を達成しました。

English

Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance. We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property--verifiable correctness--that enables effective reinforcement learning without human feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start. Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B +81%), with positive transfer to general mathematics benchmarks--our models gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME without any training on these tasks.

WirelessMathLM：無線通信における大規模言語モデルの数学的推論能力を強化学習で指導

WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning

要旨

Support