다중모드 LLM의 곱셈 연산: 텍스트, 이미지, 오디오 입력을 활용한 계산

초록

다중모달 LLM은 다양한 양식에 걸쳐 수치 정보를 정확하게 인지할 수 있지만, 동일한 산술 문제가 숫자, 수사, 이미지 또는 음성 형태로 제시될 때 정확한 다중 자릿수 곱셈을 수행하지 못합니다. 기존 벤치마크는 종종 양식 간 체계적으로 짝을 이룬 인스턴스를 포함하지 않아, 단일 모델 패밀리 내 및 여러 모델 패밀리 간의 진정한 산술 능력 한계를 비교하기 어렵습니다. 이에 따라 우리는 재현 가능한 생성기로부터 짝을 이룬 인스턴스를 통해 자릿수 길이, 자릿수 희소성, 표현 방식(예: 숫자 대 수사), 양식(텍스트, 렌더링된 이미지, 음성)을 요인 설계 방식으로 변화시키는 통제된 다중모달 곱셈 벤치마크를 소개합니다. 또한 우리는 연산 횟수에 대한 간결하고 기제 기반의 대리 지표로, 총 자릿수와 0이 아닌 자릿수의 곱인 산술 부하 C를 정의합니다. 다양한 평가에서 정확도는 C가 증가함에 따라 급격히 떨어지며, 종종 C > 100이 되면 거의 0에 가까워집니다. 실제로 C는 다양한 양식과 모델에 걸쳐 성능을 예측하는 데 유효하며, R-제곱 값이 종종 0.5를 넘어 중간 산술 단계의 수를 세는 더 복잡한 산술 부하 측정값에서 나오는 값에 근접합니다. 별도의 인지 대 연산 분해 분석은 다중모달 성능 저하가 인지적이기보다는 주로 연산적임을 보여줍니다: 일치된 인지 검사에서 모델은 곱셈 정확도가 떨어질 때조차도 모든 양식에서 거의 완벽한 성능(> 99%)을 보입니다. 모델이 실패하는 시점을 측정하는 것을 넘어, 우리는 모델이 어떤 절차를 따르는 경향이 있는지 묻습니다. 우리는 휴리스틱별 추론 접두사(열세로 곱셈, 분배 분해, 반올림/보상 포함)를 점수화하는 강제 완성 손실 프로브를 소개합니다. 여기서 분해는 텍스트와 시각 양식 모두에서 선호됩니다. 휴리스틱별 LoRA 어댑터는 거직교적인 업데이트를 생성하지만 정확도를 저하시켜, 기본 모델이 잘 조정된 내부 라우터를 유지하고 있음을 시사합니다.

English

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes--including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.

다중모드 LLM의 곱셈 연산: 텍스트, 이미지, 오디오 입력을 활용한 계산

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

초록

Support