多模态大语言模型中的乘法运算：基于文本、图像与音频输入的混合计算

摘要

多模态大语言模型能够准确感知跨模态的数值内容，但在处理以数字、数词、图像或音频形式呈现的相同算术问题时，却难以精确执行多位数乘法运算。由于现有基准测试通常缺乏跨模态的系统性配对实例，难以在模型家族内部及不同模型家族间进行真正的算术能力比较。为此，我们提出一个受控的多模态乘法基准测试，通过可复现生成器生成配对实例，系统性地控制数字长度、数字稀疏度、表征形式（如数字符号与数词）和模态（文本、渲染图像、音频）等变量。我们还定义了算术负载C——总位数与非零位数乘积的简洁机制化指标，作为运算次数的有效代理。所有评估结果显示，准确率随C值增长而急剧下降，当C>100时常趋近于零。事实上，C值在不同模态和模型间始终具有预测性，其R平方值常大于0.5，接近基于中间算术步骤数量的复杂算术负载度量效果。通过感知与计算分离实验发现，多模态性能下降主要源于计算而非感知缺陷：在感知匹配校验中，即使乘法准确率骤降，各模态下的模型表现仍接近完美（>99%）。除测量模型失效临界点外，我们还探究了其内在运算倾向。通过强制补全损失探针评估启发式推理前缀（包括竖式乘法、分配律分解、舍入补偿等），发现文本和视觉模态均倾向于分解策略；而针对特定启发式训练的LoRA适配器虽产生近乎正交的更新，却导致准确率下降，表明基础模型保持着精心调校的内部路由机制。

English

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes--including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.

多模态大语言模型中的乘法运算：基于文本、图像与音频输入的混合计算

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

摘要

Support