MathBode: LLMの数学的推論における周波数領域フィンガープリント

要旨

本論文では、大規模言語モデル（LLM）における数学的推論の動的診断ツールであるMathBodeを提案する。MathBodeは、一発の精度ではなく、各パラメトリック問題をシステムとして扱う：単一のパラメータを正弦波的に駆動し、モデル出力と厳密解の第一高調波応答をフィッティングする。これにより、解釈可能な周波数分解メトリクス――ゲイン（振幅追跡）と位相（遅れ）――が得られ、ボード線図スタイルのフィンガープリントを形成する。5つの閉形式ファミリー（線形解法、比率/飽和、複利計算、2x2線形システム、相似三角形）にわたって、この診断は系統的なローパス動作と増大する位相遅れを浮き彫りにし、精度だけでは見えにくい特性を明らかにする。我々は、いくつかのモデルを、計器を較正するシンボリックベースライン（G≈1、φ≈0）と比較する。結果は、フロンティアモデルとミッドティアモデルを動的特性に基づいて分離し、推論の忠実度と一貫性を測定可能なアクショナブルなメトリクスを標準ベンチマークに補完する、コンパクトで再現可能なプロトコルを提供する。さらなる研究と採用を可能にするため、データセットとコードをオープンソースとして公開する。

English

This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument (G approx 1, phi approx 0). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.

MathBode: LLMの数学的推論における周波数領域フィンガープリント

MathBode: Frequency-Domain Fingerprints of LLM Mathematical Reasoning

要旨

Support