Kritische Token sind wichtig: Kontrastive Schätzung auf Token-Ebene verbessert die Argumentationsfähigkeit von LLMs.

papers.abstract

Große Sprachmodelle (LLMs) haben bemerkenswerte Leistungen bei Schlussfolgerungsaufgaben gezeigt. Sie nutzen die autoregressive Token-Generierung, um Schlussfolgerungspfade zu konstruieren, die die Entwicklung einer kohärenten Gedankenreihe ermöglichen. In dieser Arbeit untersuchen wir die Auswirkung einzelner Tokens auf die endgültigen Ergebnisse von Schlussfolgerungsaufgaben. Wir identifizieren die Existenz von „kritischen Tokens“, die zu falschen Schlussfolgerungspfaden in LLMs führen. Speziell stellen wir fest, dass LLMs dazu neigen, positive Ergebnisse zu erzielen, wenn sie gezwungen sind, andere Tokens anstelle von kritischen Tokens zu decodieren. Basierend auf dieser Beobachtung schlagen wir einen neuartigen Ansatz - cDPO - vor, der darauf abzielt, kritische Tokens automatisch zu erkennen und auf Token-Ebene Belohnungen während des Ausrichtungsprozesses durchzuführen. Konkret entwickeln wir einen kontrastiven Schätzansatz, um kritische Tokens automatisch zu identifizieren. Dies wird erreicht, indem die Generierungswahrscheinlichkeit von positiven und negativen Modellen verglichen wird. Um dies zu erreichen, feinabstimmen wir die positiven und negativen Modelle separat auf verschiedenen Schlussfolgerungspfaden, sodass sie in der Lage sind, kritische Tokens innerhalb falscher Pfade zu identifizieren, die zu fehlerhaften Ergebnissen beitragen. Darüber hinaus erweitern wir die konventionellen DPO-Algorithmen während des Ausrichtungsprozesses auf Token-Ebene und nutzen die differentielle Wahrscheinlichkeit aus den genannten positiven und negativen Modellen als wichtige Gewichtung für das Lernen auf Token-Ebene. Experimentelle Ergebnisse zu den Benchmarks GSM8K und MATH500 mit den beiden weit verbreiteten Modellen Llama-3 (8B und 70B) und deepseek-math (7B) zeigen die Wirksamkeit des vorgeschlagenen Ansatzes cDPO.

English

Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens'' that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO learning.Experimental results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.

Kritische Token sind wichtig: Kontrastive Schätzung auf Token-Ebene verbessert die Argumentationsfähigkeit von LLMs.

Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's Reasoning Capability

papers.abstract

Support