전문가 혼합 모델과 문맥 내 강화 학습의 만남

초록

컨텍스트 내 강화 학습(In-context Reinforcement Learning, ICRL)은 프롬프트 조건화를 통해 다운스트림 작업에 RL 에이전트를 적응시키는 유망한 패러다임으로 부상했습니다. 그러나 RL 도메인 내에서 컨텍스트 내 학습을 완전히 활용하는 데는 두 가지 주요 과제가 남아 있습니다: 상태-행동-보상 데이터의 본질적인 다중 모달성과 다양한 이질적 특성을 가진 의사결정 작업들입니다. 이러한 과제를 해결하기 위해, 우리는 T2MIR(Token- and Task-wise MoE for In-context RL)라는 혁신적인 프레임워크를 제안합니다. 이 프레임워크는 트랜스포머 기반 의사결정 모델에 전문가 혼합(Mixture-of-Experts, MoE)의 아키텍처적 발전을 도입합니다. T2MIR는 피드포워드 레이어를 두 개의 병렬 레이어로 대체합니다: 입력 토큰의 다양한 모달리티 간의 독특한 의미를 포착하는 토큰 단위 MoE와, 다양한 작업을 특화된 전문가로 라우팅하여 광범위한 작업 분포를 관리하며 그래디언트 충돌을 완화하는 작업 단위 MoE입니다. 작업 단위 라우팅을 강화하기 위해, 우리는 작업과 라우터 표현 간의 상호 정보를 최대화하는 대조 학습 방법을 도입하여 작업 관련 정보를 더 정확하게 포착할 수 있도록 합니다. 두 MoE 구성 요소의 출력은 연결되어 다음 레이어로 전달됩니다. 포괄적인 실험 결과, T2MIR는 컨텍스트 내 학습 능력을 크게 촉진하고 다양한 유형의 베이스라인을 능가하는 것으로 나타났습니다. 우리는 MoE의 잠재력과 가능성을 ICRL에 가져와, 언어 및 비전 커뮤니티에서의 성과에 한 걸음 더 가까이 다가가는 간단하고 확장 가능한 아키텍처 개선을 제안합니다. 우리의 코드는 https://github.com/NJU-RL/T2MIR에서 확인할 수 있습니다.

English

In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.

전문가 혼합 모델과 문맥 내 강화 학습의 만남

Mixture-of-Experts Meets In-Context Reinforcement Learning

초록

Support