重要なトークンは重要です：トークンレベルの対照的推定がLLMの推論能力を向上させる

要旨

大規模言語モデル（LLMs）は、推論タスクで顕著な性能を発揮しています。彼らは自己回帰的なトークン生成を利用して推論軌跡を構築し、一貫した思考の連鎖の発展を可能にします。本研究では、個々のトークンが推論タスクの最終結果に与える影響を探求します。我々は、LLMsにおいて不正確な推論軌跡につながる「重要なトークン」の存在を特定します。具体的には、重要なトークンの代わりに他のトークンをデコードするようにLLMsを強制すると、肯定的な結果を生み出す傾向があることを発見します。この観察に基づき、私たちは新しいアプローチであるcDPOを提案します。このアプローチは、アラインメントプロセス中に重要なトークンを自動的に認識し、トークンレベルの報酬を実行するよう設計されています。具体的には、肯定的なモデルと否定的なモデルの生成尤度を比較することで、重要なトークンを自動的に特定する対比推定アプローチを開発します。これを実現するために、肯定的なモデルと否定的なモデルをそれぞれ異なる推論軌跡に対して個別に微調整し、結果的に、誤った結果に寄与する不正確な軌跡内の重要なトークンを特定できるようになります。さらに、アラインメントプロセス中にモデルを重要なトークン情報と整合させるために、従来のDPOアルゴリズムをトークンレベルのDPOに拡張し、前述の肯定的なモデルと否定的なモデルからの差異尤度をトークンレベルのDPO学習の重要な重みとして利用します。GSM8KおよびMATH500のベンチマークで、Llama-3（8Bおよび70B）およびdeepseek-math（7B）という2つの広く使用されているモデルでの実験結果は、提案されたcDPOアプローチの効果を示しています。

English

Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens'' that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO learning.Experimental results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.

重要なトークンは重要です：トークンレベルの対照的推定がLLMの推論能力を向上させる

Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

要旨

Support