추론 벡터: 과제 산술을 통한 사고 사슬 능력 전이

초록

대규모 언어 모델은 복잡한 추론 작업을 숙달하기 위해 강화 학습과 같은 비용이 많이 드는 최적화가 종종 필요합니다. 본 연구에서는 추론 능력이 한 번 학습되면 이를 컴팩트한 작업 벡터로 추출하여 모델 간에 전달할 수 있음을 보여줍니다. 우리는 동일하게 초기화된 두 개의 공개적으로 이용 가능한 Qwen2.5 모델을 사용하였는데, 하나는 지도 미세 조정(SFT)으로, 다른 하나는 동일한 데이터셋에 대해 그룹 상대 정책 최적화(GRPO)로 미세 조정되었습니다. 이를 통해 추론 벡터 \(v_{reason} = \theta_{GRPO} - \theta_{SFT}\)를 추출하였습니다. 우리는 이 벡터가 강화 학습을 통해 주입된 추론 능력을 포착하면서 SFT 과정에서 공유된 지식을 제거한다고 가정합니다. 이 벡터를 호환 가능한 지시 미세 조정 모델에 간단한 산술 연산을 통해 추가하면 다양한 추론 벤치마크에서 일관된 성능 향상을 보입니다: GSM8K(+4.9%), HumanEval(+4.3%), SciQ(+1.7%), BigBenchHard(1.5B 모델 기준 +12.3%). 이러한 성능 향상은 적대적 조건에서도 지속됩니다. 반대로, 이 벡터를 제거하면 성능이 크게 저하됩니다(GSM8K에서 -11.8%), 이는 벡터가 모델의 추론 능력에 크게 기여함을 보여줍니다. 본 연구는 일반적으로 비용이 많이 드는 훈련을 통해 개발되는 추론 능력을 기존 오픈소스 모델에서 추출하고 간단한 텐서 연산을 통해 재사용할 수 있음을 보여줌으로써, 이전의 계산적 투자를 재활용하여 모델을 강화하는 실용적인 방법을 제시합니다.

English

Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: v_{reason} = theta_{GRPO} - theta_{SFT}. We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector's strong contribution to the model's reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.

추론 벡터: 과제 산술을 통한 사고 사슬 능력 전이

Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

초록

Support