강화 에이전트 모델에서의 행동 지식 병합

초록

강화학습(RL)은 사후 훈련의 핵심 요소이며, 특히 전문적인 추론 행동이 필요한 에이전트 모델에서 중요합니다. 이러한 맥락에서 모델 병합은 서로 다른 작업에서 RL로 훈련된 다중 에이전트를 단일 일반ist 모델로 통합하는 실용적인 메커니즘을 제공합니다. 그러나 기존 병합 방법은 지도 미세 조정(SFT)을 위해 설계되어 RL로 훈련된 에이전트 모델의 작업별 능력을 보존하는 데에는 차선책입니다. 그 근본 원인은 RL과 SFT 간의 작업 벡터 불일치에 있습니다: 온-정책 RL은 매우 희소하고 이질적인 작업 벡터를 생성하는 반면, SFT 스타일 병합은 암묵적으로 밀집되고 전역적으로 비교 가능한 작업 벡터를 가정합니다. 이러한 불일치 상황에서 표준 전역 평균화를 적용하면 중요한 작업별 행동을 인코딩하는 RL의 겹치지 않는 작업 벡터가 축소되고 매개변수 업데이트가 희석됩니다. 이 문제를 해결하기 위해 우리는 RL로 훈련된 에이전트 모델을 위해 명시적으로 설계된 분포 인식 병합 프레임워크인 Reinforced Agent Merging (RAM)을 제안합니다. RAM은 공유 매개변수 업데이트와 작업별 고유 매개변수 업데이트를 분리하여 공유 구성요소는 평균화하는 동시에 고유 구성요소는 선택적으로 보존 및 재조정하여 매개변수 업데이트 희석을 방지합니다. 다양한 에이전트 도메인과 모델 아키텍처에서의 실험을 통해 RAM이 병합 기준선을 능가할 뿐만 아니라, 에이전트 간의 시너지 잠재력을 극대화하여 해당 도메인의 전문 에이전트 성능을 뛰어넘는 결과를 달성함을 입증했습니다.

English

Reinforcement learning (RL) is central to post-training, particularly for agentic models that require specialized reasoning behaviors. In this setting, model merging offers a practical mechanism for integrating multiple RL-trained agents from different tasks into a single generalist model. However, existing merging methods are designed for supervised fine-tuning (SFT), and they are suboptimal to preserve task-specific capabilities on RL-trained agentic models. The root is a task-vector mismatch between RL and SFT: on-policy RL induces task vectors that are highly sparse and heterogeneous, whereas SFT-style merging implicitly assumes dense and globally comparable task vectors. When standard global averaging is applied under this mismatch, RL's non-overlapping task vectors that encode critical task-specific behaviors are reduced and parameter updates are diluted. To address this issue, we propose Reinforced Agent Merging (RAM), a distribution-aware merging framework explicitly designed for RL-trained agentic models. RAM disentangles shared and task-specific unique parameter updates, averaging shared components while selectively preserving and rescaling unique ones to counteract parameter update dilution. Experiments across multiple agent domains and model architectures demonstrate that RAM not only surpasses merging baselines, but also unlocks synergistic potential among agents to achieve performance superior to that of specialized agents in their domains.

강화 에이전트 모델에서의 행동 지식 병합

Behavior Knowledge Merge in Reinforced Agentic Models

초록

Support