모델이 당신의 언어로 추론할 때: 사고 흔적 언어 제어는 정확성의 대가를 치른다

초록

사고 흔적(trace)을 갖춘 최근의 대형 추론 모델(Large Reasoning Models, LRMs)은 영어 추론 과제에서 강력한 성능을 보여주고 있습니다. 그러나 다른 언어로 사고하는 이들의 능력은 덜 연구되어 왔습니다. 이러한 능력은 실제 응용 프로그램에서 답변 정확도만큼 중요합니다. 사용자들은 자신의 언어로 표현된 사고 흔적이 있을 때만 이를 감독에 유용하게 활용할 수 있기 때문입니다. 우리는 XReasoning 벤치마크에서 두 가지 주요 LRM 계열을 포괄적으로 평가했으며, 가장 진보된 모델들조차도 다른 언어로 사고할 때 영어로 되돌아가거나 단편적인 추론을 생성하는 경우가 많다는 것을 발견했습니다. 이는 다국어 추론 능력에서 상당한 격차가 있음을 드러냅니다. 사용자의 언어로 추론하도록 강제하는 프롬프트 기반 개입은 가독성과 감독을 개선하지만 답변 정확도를 감소시켜 중요한 트레이드오프를 노출시킵니다. 또한, 단 100개의 예시에 대한 표적 사후 훈련이 이러한 불일치를 완화하지만 일부 정확도 손실은 여전히 남아 있음을 보여줍니다. 우리의 결과는 현재 LRM의 제한된 다국어 추론 능력을 강조하고 향후 연구 방향을 제시합니다. 코드와 데이터는 https://github.com/Betswish/mCoT-XReasoning에서 확인할 수 있습니다.

English

Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.