モデルがあなたの言語で推論するとき：思考トレース言語の制御は精度の低下を伴う

要旨

最近の思考トレースを備えた大規模推論モデル（LRM）は、英語の推論タスクにおいて強力な性能を示している。しかし、他の言語で思考する能力については、あまり研究されていない。この能力は、現実世界のアプリケーションにおいて回答精度と同様に重要である。なぜなら、ユーザーは、推論トレースが自分たちの言語で表現されている場合にのみ、それを監視に役立つと感じるからだ。我々は、XReasoningベンチマークを用いて、2つの主要なLRMファミリーを包括的に評価し、最も先進的なモデルでさえ、他の言語では英語に戻ったり、断片的な推論を生成したりすることが多いことを発見した。これは、多言語推論における大きなギャップを明らかにしている。ユーザーの言語で推論するようモデルに強制するプロンプトベースの介入は、可読性と監視を改善するが、回答精度を低下させ、重要なトレードオフを露呈する。さらに、わずか100例のターゲットを絞った追加学習により、このミスマッチを軽減できるが、いくらかの精度の損失は残ることを示した。我々の結果は、現在のLRMの限られた多言語推論能力を強調し、将来の研究の方向性を示している。コードとデータはhttps://github.com/Betswish/mCoT-XReasoningで入手可能である。

English

Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.

モデルがあなたの言語で推論するとき：思考トレース言語の制御は精度の低下を伴う

When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy

要旨

Support