ConFu: Betrachte die Zukunft für besseres spekulatives Sampling

Zusammenfassung

Spekulatives Decodieren hat sich als leistungsstarker Ansatz zur Beschleunigung der Inferenz großer Sprachmodelle (LLMs) etabliert, bei dem leichte Draft-Modelle eingesetzt werden, um Kandidatentokens vorzuschlagen, die anschließend vom Zielmodell verifiziert werden. Die Effektivität dieses Paradigmas hängt entscheidend von der Qualität des Draft-Modells ab. Obwohl neuere Fortschritte wie die EAGLE-Serie state-of-the-art Beschleunigung erreichen, bleiben bestehende Draft-Modelle durch Fehlerakkumulation eingeschränkt: Sie basieren ihre Vorhersagen nur auf dem aktuellen Präfix, was dazu führt, dass ihre Vorhersagen über mehrere Schritte vom Zielmodell abweichen. In dieser Arbeit schlagen wir ConFu (Contemplate the Future) vor, ein neuartiges Framework für spekulatives Decodieren, das Draft-Modellen ermöglicht, die zukünftige Richtung der Generierung vorauszusehen. ConFu führt (i) Contemplate-Tokens und Soft-Prompts ein, die es dem Draft-Modell erlauben, zukunftsorientierte Signale vom Zielmodell zu minimalen Kosten zu nutzen, (ii) einen Mechanismus für dynamische Contemplate-Tokens mit MoE (Mixture of Experts), um kontextbewusste Zukunftsprognosen zu ermöglichen, und (iii) ein Trainingsframework mit Anchor-Token-Sampling und Zukunftsprognose-Replikation, das robuste Zukunftsprognosen erlernt. Experimente zeigen, dass ConFu die Token-Akzeptanzraten und die Generierungsgeschwindigkeit gegenüber EAGLE-3 über verschiedene Downstream-Aufgaben mit Llama-3-3B- und 8B-Modellen um 8–11 % steigert. Wir sind der Ansicht, dass unsere Arbeit erstmals spekulatives Decodieren mit kontinuierlichen Reasoning-Tokens verbindet und damit eine neue Richtung zur Beschleunigung der LLM-Inferenz aufzeigt.

English

Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose ConFu (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.

ConFu: Betrachte die Zukunft für besseres spekulatives Sampling

ConFu: Contemplate the Future for Better Speculative Sampling

Zusammenfassung

Support