強化學習智能體模型中的行為知識融合
Behavior Knowledge Merge in Reinforced Agentic Models
January 20, 2026
作者: Xiangchi Yuan, Dachuan Shi, Chunhui Zhang, Zheyuan Liu, Shenglong Yao, Soroush Vosoughi, Wenke Lee
cs.AI
摘要
強化學習(RL)在後訓練階段具有核心地位,特別是對於需要專業推理行為的主體模型而言。在此背景下,模型融合提供了一種實用機制,可將來自不同任務的多個RL訓練主體整合為單一通用模型。然而,現有的融合方法專為監督式微調(SFT)設計,在保留RL訓練主體模型的任務專屬能力方面表現欠佳。其根源在於RL與SFT之間的任務向量不匹配:策略式RL產生的任務向量具有高度稀疏性與異質性,而SFT式融合隱含假設任務向量具備稠密性與全局可比性。當在此不匹配條件下應用標準全局平均法時,RL中編碼關鍵任務專屬行為的非重疊任務向量會被削弱,參數更新亦遭稀釋。為解決此問題,我們提出「強化主體融合」(RAM),這是一個專為RL訓練主體模型設計的分布感知融合框架。RAM能分離共享參數更新與任務專屬獨特參數更新,對共享組件進行平均處理的同時,選擇性保留並重新調整獨特組件,以抵消參數更新的稀釋效應。跨越多個主體領域與模型架構的實驗表明,RAM不僅超越現有融合基線,更能釋放主體間的協同潛力,實現優於各領域專屬主體的表現。
English
Reinforcement learning (RL) is central to post-training, particularly for agentic models that require specialized reasoning behaviors. In this setting, model merging offers a practical mechanism for integrating multiple RL-trained agents from different tasks into a single generalist model. However, existing merging methods are designed for supervised fine-tuning (SFT), and they are suboptimal to preserve task-specific capabilities on RL-trained agentic models. The root is a task-vector mismatch between RL and SFT: on-policy RL induces task vectors that are highly sparse and heterogeneous, whereas SFT-style merging implicitly assumes dense and globally comparable task vectors. When standard global averaging is applied under this mismatch, RL's non-overlapping task vectors that encode critical task-specific behaviors are reduced and parameter updates are diluted. To address this issue, we propose Reinforced Agent Merging (RAM), a distribution-aware merging framework explicitly designed for RL-trained agentic models. RAM disentangles shared and task-specific unique parameter updates, averaging shared components while selectively preserving and rescaling unique ones to counteract parameter update dilution. Experiments across multiple agent domains and model architectures demonstrate that RAM not only surpasses merging baselines, but also unlocks synergistic potential among agents to achieve performance superior to that of specialized agents in their domains.