專家混合模型遇上上下文強化學習
Mixture-of-Experts Meets In-Context Reinforcement Learning
June 5, 2025
作者: Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang
cs.AI
摘要
情境强化学习(In-context Reinforcement Learning, ICRL)作为一种通过提示条件调整RL代理以适应下游任务的新兴范式,展现出巨大潜力。然而,在RL领域充分利用情境学习仍面临两大挑战:状态-动作-奖励数据固有的多模态特性,以及决策任务的多样性和异质性。为应对这些挑战,我们提出了T2MIR(面向情境RL的令牌与任务级混合专家模型),这一创新框架将混合专家(Mixture-of-Experts, MoE)的架构创新引入基于Transformer的决策模型中。T2MIR以前馈层为替代,构建了两层并行结构:令牌级MoE,旨在捕捉跨多模态输入令牌的独特语义;任务级MoE,则将多样任务路由至专门专家,以管理广泛的任务分布,同时缓解梯度冲突。为增强任务级路由能力,我们引入了一种对比学习方法,最大化任务与其路由表示间的互信息,从而更精准地捕捉任务相关信息。两个MoE组件的输出被拼接后输入下一层。全面实验表明,T2MIR显著提升了情境学习能力,并超越了多种基线模型。我们将MoE的潜力与前景带入ICRL,提供了一种简单且可扩展的架构增强方案,推动ICRL向语言与视觉领域的成就更进一步迈进。代码已发布于https://github.com/NJU-RL/T2MIR。
English
In-context reinforcement learning (ICRL) has emerged as a promising paradigm
for adapting RL agents to downstream tasks through prompt conditioning.
However, two notable challenges remain in fully harnessing in-context learning
within RL domains: the intrinsic multi-modality of the state-action-reward data
and the diverse, heterogeneous nature of decision tasks. To tackle these
challenges, we propose T2MIR (Token- and Task-wise
MoE for In-context RL), an innovative framework that
introduces architectural advances of mixture-of-experts (MoE) into
transformer-based decision models. T2MIR substitutes the feedforward layer with
two parallel layers: a token-wise MoE that captures distinct semantics of input
tokens across multiple modalities, and a task-wise MoE that routes diverse
tasks to specialized experts for managing a broad task distribution with
alleviated gradient conflicts. To enhance task-wise routing, we introduce a
contrastive learning method that maximizes the mutual information between the
task and its router representation, enabling more precise capture of
task-relevant information. The outputs of two MoE components are concatenated
and fed into the next layer. Comprehensive experiments show that T2MIR
significantly facilitates in-context learning capacity and outperforms various
types of baselines. We bring the potential and promise of MoE to ICRL, offering
a simple and scalable architectural enhancement to advance ICRL one step closer
toward achievements in language and vision communities. Our code is available
at https://github.com/NJU-RL/T2MIR.