平行擴展定律：透過跨語言視角揭示推理泛化能力

摘要

近期在強化後訓練（Reinforcement Post-Training, RPT）領域的進展顯著提升了大型推理模型（Large Reasoning Models, LRMs）的能力，激發了對基於強化學習的推理泛化能力的更多關注。雖然現有研究主要集中在探討其跨任務或跨模態的泛化能力，但本研究提出了一種新穎的跨語言視角來探討推理泛化。這引發了一個關鍵問題：從英語RPT中獲得的推理能力是否能有效轉移到其他語言？我們通過系統性地評估以英語為中心的LRMs在多語言推理基準上的表現，並引入一個量化跨語言可轉移性的指標來回答這一問題。我們的研究發現，跨語言可轉移性在初始模型、目標語言和訓練範式之間存在顯著差異。通過干預性研究，我們發現具有更強初始英語能力的模型往往過度依賴英語特定的模式，導致跨語言泛化能力下降。為解決這一問題，我們進行了全面的平行訓練研究。實驗結果得出三個關鍵發現：首先是「首次平行躍遷」，即從單語言訓練轉向僅使用一種平行語言時性能的顯著提升；其次是可預測的「平行縮放定律」，揭示了跨語言推理轉移遵循與訓練平行語言數量相關的冪律；此外，我們將實際單語言性能與冪律預測之間的差異定義為「單語言泛化差距」，表明以英語為中心的LRMs未能完全實現跨語言泛化。我們的研究挑戰了LRM推理與人類認知相似的假設，為開發更具語言無關性的LRMs提供了重要見解。

English

Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: Does the reasoning capability achieved from English RPT effectively transfer to other languages? We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: First-Parallel Leap, a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable Parallel Scaling Law, revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as Monolingual Generalization Gap, indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.

平行擴展定律：透過跨語言視角揭示推理泛化能力

Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective

摘要

Support