문화적 번역에서 길을 잃다: LLM은 문화적 맥락에서 수학 문제를 해결하는 데 어려움을 겪는가?

초록

대형 언어 모델(LLMs)은 코딩, 수학적 추론, 논리적 문제 해결 등 다양한 분야에서 상당한 발전을 이루었습니다. 그러나 중요한 질문이 남아 있습니다: 이러한 수학적 추론 능력이 문화적으로 적응된 수학 문제에 직면했을 때도 유지되는가? 특히, 주류 웹 규모의 AI 훈련 데이터에서 상당한 표현이 없는 문화적 맥락에 내재된 수학 문제에 대해 LLMs는 어떻게 수행하는가? 이를 탐구하기 위해, 우리는 LLMs의 수학적 추론 능력을 평가하는 데 널리 사용되는 벤치마크인 GSM8K에서 6개의 합성 문화 데이터셋을 생성했습니다. 원본 GSM8K 테스트 세트의 수학적 논리와 수치적 값을 유지하면서, 개인 이름, 음식 항목, 장소 이름 등과 같은 문화적 요소를 수정했습니다. 이러한 문화적으로 적응된 데이터셋은 변화하는 문화적 맥락에서 LLMs의 수학적 추론을 평가하는 데 더 신뢰할 수 있는 프레임워크를 제공합니다. 우리의 연구 결과는 LLMs가 문화적 참조가 변경될 때 수학 문제에 어려움을 겪는 것으로 나타났으며, 이는 기본적인 수학적 구조가 일정함에도 불구하고 발생합니다. 더 작은 모델은 더 큰 모델에 비해 더 큰 성능 저하를 보였습니다. 흥미롭게도, 우리의 결과는 문화적 친숙함이 수학적 추론을 향상시킬 수 있음을 시사합니다. 명시적인 수학적 훈련은 없지만 관련 문화적 맥락에 노출된 모델이 때로는 더 크고 수학적으로 능숙한 모델보다 문화적으로 내재된 수학 문제에서 더 나은 성능을 보이기도 했습니다. 이 연구는 LLMs의 수학적 추론 능력에 미치는 문화적 맥락의 영향을 강조하며, 실제 응용 프로그램에서의 견고성을 향상시키기 위해 더 다양하고 대표적인 훈련 데이터의 필요성을 강조합니다. 결과를 재현하기 위한 벤치마크 데이터셋과 스크립트는 https://github.com/akarim23131/Lost_in_Cultural_Translation에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have significantly advanced various fields, particularly coding, mathematical reasoning, and logical problem solving. However, a critical question remains: Do these mathematical reasoning abilities persist when LLMs are presented with culturally adapted math problems? Specifically, how do LLMs perform when faced with math problems embedded in cultural contexts that have no significant representation in main stream web-scale AI training data? To explore this, we generated six synthetic cultural datasets from GSM8K, a widely used benchmark for assessing LLMs' mathematical reasoning skills. While preserving the mathematical logic and numerical values of the original GSM8K test set, we modify cultural elements such as personal names, food items, place names, etc. These culturally adapted datasets provide a more reliable framework for evaluating LLMs' mathematical reasoning under shifting cultural contexts. Our findings reveal that LLMs struggle with math problems when cultural references change, even though the underlying mathematical structure remains constant. Smaller models exhibit greater performance drops compared to larger models. Interestingly, our results also suggest that cultural familiarity can enhance mathematical reasoning. Even models with no explicit mathematical training but exposure to relevant cultural contexts sometimes outperform larger, mathematically proficient models on culturally embedded math problems. This study highlights the impact of cultural context on the mathematical reasoning abilities of LLMs, underscoring the need for more diverse and representative training data to improve robustness in real-world applications. The benchmark data sets and script for reproducing the results are available at https://github.com/akarim23131/Lost_in_Cultural_Translation

문화적 번역에서 길을 잃다: LLM은 문화적 맥락에서 수학 문제를 해결하는 데 어려움을 겪는가?

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

초록

Support