迷失在文化翻译中:大语言模型是否在跨文化数学语境中表现欠佳?
Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?
March 23, 2025
作者: Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, Abdul Sattar
cs.AI
摘要
大型语言模型(LLMs)在多个领域取得了显著进展,尤其是在编程、数学推理和逻辑问题解决方面。然而,一个关键问题依然存在:当LLMs面对经过文化适应性调整的数学问题时,这些数学推理能力是否依然有效?具体而言,当LLMs遇到嵌入在主流网络规模AI训练数据中缺乏显著代表性的文化背景中的数学问题时,其表现如何?为探究这一问题,我们从GSM8K——一个广泛用于评估LLMs数学推理能力的基准测试集——中生成了六个合成文化数据集。在保持原GSM8K测试集数学逻辑和数值不变的前提下,我们修改了诸如人名、食品名称、地名等文化元素。这些经过文化适应性调整的数据集为评估LLMs在变化文化背景下的数学推理能力提供了更为可靠的框架。我们的研究发现,尽管数学结构保持不变,LLMs在文化参照发生变化时处理数学问题的能力显著下降。相较于大型模型,小型模型表现出更大的性能降幅。有趣的是,研究结果还表明,文化熟悉度能够增强数学推理能力。即便没有明确数学训练但接触过相关文化背景的模型,有时在解决嵌入文化背景的数学问题时,也能超越那些数学能力强但文化背景不匹配的大型模型。本研究强调了文化背景对LLMs数学推理能力的影响,凸显了在现实世界应用中提升模型鲁棒性所需更多样化和代表性训练数据的必要性。基准数据集及复现结果的脚本可在https://github.com/akarim23131/Lost_in_Cultural_Translation获取。
English
Large Language Models (LLMs) have significantly advanced various fields,
particularly coding, mathematical reasoning, and logical problem solving.
However, a critical question remains: Do these mathematical reasoning abilities
persist when LLMs are presented with culturally adapted math problems?
Specifically, how do LLMs perform when faced with math problems embedded in
cultural contexts that have no significant representation in main stream
web-scale AI training data? To explore this, we generated six synthetic
cultural datasets from GSM8K, a widely used benchmark for assessing LLMs'
mathematical reasoning skills. While preserving the mathematical logic and
numerical values of the original GSM8K test set, we modify cultural elements
such as personal names, food items, place names, etc. These culturally adapted
datasets provide a more reliable framework for evaluating LLMs' mathematical
reasoning under shifting cultural contexts. Our findings reveal that LLMs
struggle with math problems when cultural references change, even though the
underlying mathematical structure remains constant. Smaller models exhibit
greater performance drops compared to larger models. Interestingly, our results
also suggest that cultural familiarity can enhance mathematical reasoning. Even
models with no explicit mathematical training but exposure to relevant cultural
contexts sometimes outperform larger, mathematically proficient models on
culturally embedded math problems. This study highlights the impact of cultural
context on the mathematical reasoning abilities of LLMs, underscoring the need
for more diverse and representative training data to improve robustness in
real-world applications. The benchmark data sets and script for reproducing the
results are available at
https://github.com/akarim23131/Lost_in_Cultural_Translation