ChatPaper.aiChatPaper

MathBode:大语言模型数学推理的频域特征指纹

MathBode: Frequency-Domain Fingerprints of LLM Mathematical Reasoning

September 27, 2025
作者: Charles L. Wang
cs.AI

摘要

本文介绍了MathBode,一种用于诊断大型语言模型(LLMs)数学推理能力的动态方法。不同于一次性准确率评估,MathBode将每个参数化问题视为一个系统:我们通过正弦方式驱动单一参数,并拟合模型输出与精确解的一阶谐波响应。由此得到可解释的频率解析指标——增益(幅度跟踪)和相位(滞后),它们构成了Bode图式的特征指纹。在五个闭式问题族(线性求解、比率/饱和、复利、2x2线性系统、相似三角形)中,该诊断揭示了系统性的低通行为及逐渐增大的相位滞后,这些现象是单纯依赖准确率评估所无法察觉的。我们将多个模型与一个符号基准进行比较,后者用于校准仪器(增益G≈1,相位φ≈0)。结果显示,前沿模型与中端模型在动态特性上存在显著差异,提供了一个简洁、可复现的测试协议,该协议通过可操作的推理保真度与一致性测量,对标准基准测试形成了有力补充。我们开源了数据集与代码,以促进进一步的研究与应用。
English
This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument (G approx 1, phi approx 0). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.
PDF32September 30, 2025