文化的翻訳における迷い：LLMは文化的文脈を越えた数学に苦戦するのか？

要旨

大規模言語モデル（LLMs）は、特にコーディング、数学的推論、論理的問題解決といった様々な分野で大きな進歩を遂げてきた。しかし、重要な疑問が残されている：これらの数学的推論能力は、LLMsが文化的に適応された数学問題に直面した際にも持続するのだろうか？具体的には、主流のウェブスケールAIトレーニングデータに重要な表現がない文化的文脈に埋め込まれた数学問題にLLMsがどのように対応するのか？これを探るため、我々はLLMsの数学的推論能力を評価するために広く使用されているベンチマークであるGSM8Kから、6つの合成的な文化的データセットを生成した。元のGSM8Kテストセットの数学的論理と数値を保持しつつ、個人名、食品名、地名などの文化的要素を変更した。これらの文化的に適応されたデータセットは、変化する文化的文脈下でのLLMsの数学的推論を評価するためのより信頼性の高い枠組みを提供する。我々の調査結果は、文化的参照が変化した場合、数学的構造が変わらないにもかかわらず、LLMsが数学問題に苦戦することを明らかにしている。小規模なモデルは、大規模なモデルと比較してより大きな性能低下を示す。興味深いことに、我々の結果は、文化的な親しみが数学的推論を向上させる可能性があることも示唆している。明示的な数学的トレーニングを受けていないが、関連する文化的文脈にさらされたモデルが、文化的に埋め込まれた数学問題において、より大規模で数学的に熟練したモデルを凌ぐ場合もある。この研究は、LLMsの数学的推論能力に対する文化的文脈の影響を強調し、現実世界のアプリケーションにおける堅牢性を向上させるためにより多様で代表的なトレーニングデータの必要性を強調している。ベンチマークデータセットと結果を再現するためのスクリプトは、https://github.com/akarim23131/Lost_in_Cultural_Translation で利用可能である。

English

Large Language Models (LLMs) have significantly advanced various fields, particularly coding, mathematical reasoning, and logical problem solving. However, a critical question remains: Do these mathematical reasoning abilities persist when LLMs are presented with culturally adapted math problems? Specifically, how do LLMs perform when faced with math problems embedded in cultural contexts that have no significant representation in main stream web-scale AI training data? To explore this, we generated six synthetic cultural datasets from GSM8K, a widely used benchmark for assessing LLMs' mathematical reasoning skills. While preserving the mathematical logic and numerical values of the original GSM8K test set, we modify cultural elements such as personal names, food items, place names, etc. These culturally adapted datasets provide a more reliable framework for evaluating LLMs' mathematical reasoning under shifting cultural contexts. Our findings reveal that LLMs struggle with math problems when cultural references change, even though the underlying mathematical structure remains constant. Smaller models exhibit greater performance drops compared to larger models. Interestingly, our results also suggest that cultural familiarity can enhance mathematical reasoning. Even models with no explicit mathematical training but exposure to relevant cultural contexts sometimes outperform larger, mathematically proficient models on culturally embedded math problems. This study highlights the impact of cultural context on the mathematical reasoning abilities of LLMs, underscoring the need for more diverse and representative training data to improve robustness in real-world applications. The benchmark data sets and script for reproducing the results are available at https://github.com/akarim23131/Lost_in_Cultural_Translation

文化的翻訳における迷い：LLMは文化的文脈を越えた数学に苦戦するのか？

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

要旨

Support