文法に基づく推論：合成的な言語推論トレースは低リソース機械翻訳を向上させることができるか？

要旨

大規模言語モデル（LLM）は、インコンテキスト学習を通じて言語リソースを取り入れることで、極めて低リソースな言語に対する機械翻訳（MT）の有望な手法を提供する。しかしながら、LLMは翻訳中に文法的情報を効果的に適用することにしばしば苦慮する。連鎖思考推論における最近の進展に着想を得て、我々は低リソースMTが言語分析と文法的推論の構造化された中間ステップから恩恵を受けられるかを調査する。我々は、ユニバーサル依存関係ツリーバンク、辞書、文法ルールバンクから段階的な言語推論トレースを自動生成するパイプラインを提案する。我々はこれらのトレースを、シベ語とチンタン語をテストケースとして、インコンテキスト学習（ICL）、教師ありファインチューニング（SFT）、強化学習ファインチューニング（RFT）の三つの設定で評価する。我々の結果は、言語推論トレースが推論時のガイダンスとして最も効果的であることを示している。すなわち、ICLにおいて、信頼性の高い文固有のトレースは、ほとんどのモデル、言語、評価指標において翻訳性能を大幅に向上させる。対照的に、言語推論トレースを訓練データとして使用すると、モデルはトレースの形式を学習するものの誤った内容を生成することが多く、そのため向上は小さく一貫性も低い。これらの発見は、LLMは信頼性の高い言語分析が与えられれば低リソースMTに文法的情報を活用できる一方で、そのような分析を生成することを学習することが主要なボトルネックであり続けることを示唆している。

English

Large language models (LLMs) offer a promising approach to machine translation (MT) for extremely low-resource languages by incorporating linguistic resources through in-context learning. However, LLMs often struggle to apply grammatical information effectively during translation. Inspired by recent progress in chain-of-thought reasoning, we investigate whether low-resource MT can benefit from structured intermediate steps of linguistic analysis and grammatical reasoning. We propose a pipeline for automatically generating step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks. We evaluate these traces in three settings: in-context learning (ICL), supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT), on Xibe and Chintang as test cases. Our results show that linguistic reasoning traces are most effective as inference-time guidance: in ICL, reliable sentence-specific traces substantially improve translation performance across most models, languages, and metrics. In contrast, using the linguistic reasoning traces as training data yields smaller and less consistent gains, as models learn the trace format but often generate erroneous content. These findings suggest that LLMs can leverage grammatical information for low-resource MT when given reliable linguistic analyses, while learning to generate such analyses remains a major bottleneck.