유니버설 리소너: 고정된 LLM을 위한 단일, 구성 가능한 플러그 앤 플레이 추론기

초록

대규모 언어 모델(LLMs)은 놀라운 일반적 능력을 보여주지만, 추론과 같은 기술을 향상시키는 데는 상당한 계산 자원이 필요하며, 이는 모델의 일반화 능력을 저해할 수 있습니다. 매개변수 효율적 미세 조정(PEFT) 방법은 자원을 절약할 수 있는 대안을 제공하지만, 일반적으로 아키텍처 의존성으로 인해 각 LLM 백본마다 재학습이 필요합니다. 이러한 문제를 해결하기 위해, 본 연구에서는 범용 추론 모듈(Universal Reasoner, UniR)을 제안합니다. UniR은 경량화되고, 조합 가능하며, 플러그 앤 플레이 방식으로 동작하는 단일 추론 모듈로, 어떠한 고정된 LLM과도 결합하여 특화된 추론 능력을 부여할 수 있습니다. 구체적으로, UniR은 보상을 독립적인 추론 모듈로 분해하여 미리 정의된 보상을 사용하여 독립적으로 학습하며, 궤적 수준의 신호를 토큰 수준의 지도로 효과적으로 변환합니다. 일단 학습이 완료되면, UniR은 추론 시 LLM 백본의 출력 로짓(logits)에 단순히 자신의 출력 로짓을 더함으로써 어떠한 고정된 LLM과도 결합할 수 있습니다. 이 가법적 구조는 자연스럽게 모듈식 조합을 가능하게 합니다: 서로 다른 작업을 위해 학습된 여러 UniR 모듈을 로짓을 합산하여 공동으로 적용함으로써, 복잡한 추론을 조합을 통해 가능하게 합니다. 수학적 추론 및 기계 번역 작업에 대한 실험 결과는 UniR이 Llama3.2 모델을 사용한 기존의 미세 조정 방법을 크게 능가함을 보여줍니다. 더 나아가, UniR은 강력한 약-강 일반화 능력을 보여줍니다: 더 작은 모델에서 학습된 추론 모듈이 훨씬 더 큰 LLM을 효과적으로 안내합니다. 이는 UniR이 LLM의 핵심 능력을 저해하지 않으면서도 추론 능력을 향상시키는 데 있어 비용 효율적이고, 적응 가능하며, 견고한 솔루션임을 입증합니다. 코드는 https://github.com/hangeol/UniR에서 공개되어 있습니다.

English

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise their generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically requires retraining for each LLM backbone due to architectural dependencies. To address these challenges, here we propose Universal Reasoner (UniR) - a single, lightweight, composable, and plug-and-play reasoning module that can be used with any frozen LLM to endow it with specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module that is trained independently using predefined rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR can be combined with any frozen LLM at inference time by simply adding its output logits to those of the LLM backbone. This additive structure naturally enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Experimental results on mathematical reasoning and machine translation tasks show that UniR significantly outperforms existing baseline fine-tuning methods using the Llama3.2 model. Furthermore, UniR demonstrates strong weak-to-strong generalization: reasoning modules trained on smaller models effectively guide much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust solution for enhancing reasoning in LLMs without compromising their core capabilities. Code is open-sourced at https://github.com/hangeol/UniR

유니버설 리소너: 고정된 LLM을 위한 단일, 구성 가능한 플러그 앤 플레이 추론기

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

초록

Support