マルチモーダルLLMにおける乗算：テキスト・画像・音声入力を用いた計算

要旨

マルチモーダルLLMは、様々なモダリティにわたる数値情報を正確に知覚できるにもかかわらず、同一の基礎的な算術問題が数字、数詞、画像、音声形式で提示された場合、正確な多桁乗算を実行できない。既存のベンチマークは、モダリティ間で体系的にペアリングされたインスタンスを欠くことが多いため、モデルファミリー内およびファミリー間の真の算術能力の限界を比較することが困難である。そこで我々は、桁数、数字の疎性、表現形式（数字対数詞など）、モダリティ（テキスト、レンダリング画像、音声）を因子として体系的に変化させ、再現可能な生成器からペアリングされたインスタンスを提供する、制御されたマルチモーダル乗算ベンチマークを提案する。また、演算回数のコンパクトで機構論的に動機づけられた代理指標として、総桁数と非零桁数の積で定義される算術負荷Cを定義する。評価全体を通じて、Cが増加するにつれて精度は急激に低下し、C > 100では多くの場合ほぼゼロに近づく。実際、Cはモダリティやモデルを超えて性能を予測し、決定係数R二乗はしばしば0.5を超え、中間的な算術ステップの数をカウントするより複雑な算術負荷測定値に近い値を示す。知覚と計算を分離した分解分析によれば、マルチモーダルでの性能低下は主に知覚的ではなく計算的である：知覚が一致するチェックでは、乗算精度が低下する場合でも、モデルはモダリティを問わずほぼ完璧（> 99%）の性能を発揮する。モデルがいつ失敗するかを測定するだけでなく、どのような手順に従う傾向があるかを探る。我々は、ヒューリスティック固有の推論プレフィックス（筆算乗算、分配則による分解、丸め/補正を含む）を評価する強制完了損失プローブを導入する。ここでは、テキストと視覚の両モダリティにおいて分解が好まれることがわかる。ヒューリスティック固有のLoRAアダプターはほぼ直交する更新を生成するが精度を低下させることから、ベースモデルが適切に調整された内部ルーターを維持していることが示唆される。

English

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes--including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.

マルチモーダルLLMにおける乗算：テキスト・画像・音声入力を用いた計算

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

要旨

Support