跨語言推理通過測試時縮放
Crosslingual Reasoning through Test-Time Scaling
May 8, 2025
作者: Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H. Bach, Alham Fikri Aji
cs.AI
摘要
大型語言模型的推理能力主要針對英語進行研究,即便這些模型在預訓練階段是多語言的。在本研究中,我們探討了基於長鏈思維(CoTs)的英語推理微調能在多大程度上跨語言泛化。首先,我們發現,對於以英語為中心的推理語言模型(RLMs),增加推理計算資源能夠提升多種語言(包括低資源語言)的數學推理能力,其效果甚至超過了規模是其兩倍的模型。其次,我們揭示出,儘管以英語為中心的RLM的CoTs自然以英語為主,但它們在處理引用的非英語輸入時,始終遵循“引用-思考”的模式進行推理。第三,我們發現了一種有效策略來控制長鏈CoT推理的語言,並觀察到模型在高資源語言中推理更為高效且效果更好。最後,我們注意到模型在跨領域推理泛化上表現不佳,特別是從STEM領域到文化常識知識的轉移,即便是在英語中也是如此。總體而言,我們展示了英語推理測試時擴展的跨語言泛化潛力,研究了其機制,並概述了其局限性。我們得出結論,實踐者應讓以英語為中心的RLM在高資源語言中進行推理,同時還需進一步研究以提升低資源語言和跨領域上下文中的推理能力。
English
Reasoning capabilities of large language models are primarily studied for
English, even when pretrained models are multilingual. In this work, we
investigate to what extent English reasoning finetuning with long
chain-of-thoughts (CoTs) can generalize across languages. First, we find that
scaling up inference compute for English-centric reasoning language models
(RLMs) improves multilingual mathematical reasoning across many languages
including low-resource languages, to an extent where they outperform models
twice their size. Second, we reveal that while English-centric RLM's CoTs are
naturally predominantly English, they consistently follow a quote-and-think
pattern to reason about quoted non-English inputs. Third, we discover an
effective strategy to control the language of long CoT reasoning, and we
observe that models reason better and more efficiently in high-resource
languages. Finally, we observe poor out-of-domain reasoning generalization, in
particular from STEM to cultural commonsense knowledge, even for English.
Overall, we demonstrate the potentials, study the mechanisms and outline the
limitations of crosslingual generalization of English reasoning test-time
scaling. We conclude that practitioners should let English-centric RLMs reason
in high-resource languages, while further work is needed to improve reasoning
in low-resource languages and out-of-domain contexts.Summary
AI-Generated Summary