RelayLLM: Efficiënt Redeneren via Collaboratieve Decodering

Samenvatting

Grote Taalmodellen (LLM's) voor complex redeneren worden vaak belemmerd door hoge computationele kosten en latentie, terwijl resource-efficiënte Kleine Taalmodellen (KLM's) doorgaans de nodige redeneercapaciteit ontberen. Bestaande collaboratieve benaderingen, zoals cascadering of routering, werken op een grove granulariteit door volledige queries af te wikkelen naar LLM's, wat leidt tot aanzienlijk computationeel verlies wanneer de KLM het merendeel van de redeneerstappen aankan. Om dit aan te pakken, stellen wij RelayLLM voor, een nieuw kader voor efficiënt redeneren via collaboratieve decoding op tokenniveau. In tegenstelling tot routers stelt RelayLLM de KLM in staat om op te treden als een actieve controller die de LLM dynamisch oproept alleen voor kritieke tokens via een speciaal commando, waarbij het generatieproces effectief wordt "doorgegeven". Wij introduceren een tweefasig trainingskader, inclusief een opwarmfase en Group Relative Policy Optimization (GRPO), om het model te leren een balans te vinden tussen onafhankelijkheid en strategisch hulp zoeken. Empirische resultaten over zes benchmarks tonen aan dat RelayLLM een gemiddelde nauwkeurigheid van 49,52% bereikt, waardoor de prestatiekloof tussen de twee modellen effectief wordt overbrugd. Opmerkelijk is dat dit wordt bereikt door de LLM slechts aan te roepen voor 1,07% van de gegenereerde tokens, wat een kostreductie van 98,2% oplevert in vergelijking met prestatie-afgestemde willekeurige routers.

English

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

RelayLLM: Efficiënt Redeneren via Collaboratieve Decodering

RelayLLM: Efficient Reasoning via Collaborative Decoding

Samenvatting

Support