通用推理器:面向冻结大型语言模型的单一、可组合即插即用推理模块
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
May 25, 2025
作者: Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye
cs.AI
摘要
大型语言模型(LLMs)已展现出卓越的通用能力,然而提升诸如推理等技能通常需要大量计算资源,并可能削弱其泛化能力。尽管参数高效微调(PEFT)方法提供了一种更为资源节约的替代方案,但由于架构依赖性,它们通常需要对每个LLM主干进行重新训练。为应对这些挑战,本文提出通用推理器(UniR)——一个轻量级、可组合、即插即用的单一推理模块,可与任何冻结的LLM结合,赋予其专门的推理能力。具体而言,UniR将奖励分解为一个独立的推理模块,该模块利用预定义的奖励独立训练,有效地将轨迹级信号转化为令牌级指导。一旦训练完成,UniR可在推理时通过将其输出逻辑与LLM主干的逻辑简单相加,与任何冻结的LLM结合使用。这种加法结构自然支持模块化组合:针对不同任务训练的多个UniR模块可通过逻辑求和联合应用,实现复杂推理的模块化组合。在数学推理和机器翻译任务上的实验结果表明,使用Llama3.2模型时,UniR显著优于现有的基线微调方法。此外,UniR展示了强大的弱到强泛化能力:在较小模型上训练的推理模块能有效指导更大的LLMs。这使得UniR成为在不损害LLM核心能力的前提下,提升其推理能力的成本效益高、适应性强且稳健的解决方案。代码已开源,地址为https://github.com/hangeol/UniR。
English
Large Language Models (LLMs) have demonstrated remarkable general
capabilities, but enhancing skills such as reasoning often demands substantial
computational resources and may compromise their generalization. While
Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious
alternative, they typically requires retraining for each LLM backbone due to
architectural dependencies. To address these challenges, here we propose
Universal Reasoner (UniR) - a single, lightweight, composable, and
plug-and-play reasoning module that can be used with any frozen LLM to endow it
with specialized reasoning capabilities. Specifically, UniR decomposes the
reward into a standalone reasoning module that is trained independently using
predefined rewards, effectively translating trajectory-level signals into
token-level guidance. Once trained, UniR can be combined with any frozen LLM at
inference time by simply adding its output logits to those of the LLM backbone.
This additive structure naturally enables modular composition: multiple UniR
modules trained for different tasks can be jointly applied by summing their
logits, enabling complex reasoning via composition. Experimental results on
mathematical reasoning and machine translation tasks show that UniR
significantly outperforms existing baseline fine-tuning methods using the
Llama3.2 model. Furthermore, UniR demonstrates strong weak-to-strong
generalization: reasoning modules trained on smaller models effectively guide
much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust
solution for enhancing reasoning in LLMs without compromising their core
capabilities. Code is open-sourced at https://github.com/hangeol/UniR