减轻泛化税:面向大语言模型智能体的跨领域强化学习泛化能力研究
Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents
January 26, 2026
作者: Zhihan Liu, Lin Guan, Yixin Nie, Kai Zhang, Zhuoqun Hao, Lin Chen, Asli Celikyilmaz, Zhaoran Wang, Na Zhang
cs.AI
摘要
通用型大语言模型智能体通常在有限环境集上进行后训练,却被部署到更广泛的未知领域。本研究针对测试领域未知情况下的智能体后训练挑战展开探究,重点分析了强化学习环境特性与建模选择对跨领域性能的影响机制。首先,我们识别出与跨领域泛化能力强相关的两个环境维度:(一)状态信息丰富度,即智能体需从状态中处理的信息量;(二)规划复杂度,通过基础策略下的目标可达性与轨迹长度进行估算。值得注意的是,领域真实性与文本相似度并非主要因素——例如简单网格世界Sokoban在SciWorld中的泛化效果反而优于更接近现实的ALFWorld。基于这些发现,我们进一步证明仅提升状态信息丰富度即可有效增强跨领域鲁棒性,并提出一种低开销、普适性的随机化技术:在状态中添加少量与目标无关的干扰特征,在不改变任务本质的前提下增强状态丰富度。除环境特性外,我们还检验了多种建模选择:(a)监督微调热身或训练中插入虽能防止强化学习过程中的灾难性遗忘,但会削弱对未参与训练数据混合领域的泛化能力;(b)在强化学习中启用逐步推理机制虽不总能提升领域内性能,但对保持泛化能力具有关键作用。
English
Generalist LLM agents are often post-trained on a narrow set of environments but deployed across far broader, unseen domains. In this work, we investigate the challenge of agentic post-training when the eventual test domains are unknown. Specifically, we analyze which properties of reinforcement learning (RL) environments and modeling choices have the greatest influence on out-of-domain performance. First, we identify two environment axes that strongly correlate with cross-domain generalization: (i) state information richness, i.e., the amount of information for the agent to process from the state, and (ii) planning complexity, estimated via goal reachability and trajectory length under a base policy. Notably, domain realism and text-level similarity are not the primary factors; for instance, the simple grid-world domain Sokoban leads to even stronger generalization in SciWorld than the more realistic ALFWorld. Motivated by these findings, we further show that increasing state information richness alone can already effectively improve cross-domain robustness. We propose a randomization technique, which is low-overhead and broadly applicable: add small amounts of distractive goal-irrelevant features to the state to make it richer without altering the task. Beyond environment-side properties, we also examine several modeling choices: (a) SFT warmup or mid-training helps prevent catastrophic forgetting during RL but undermines generalization to domains that are not included in the mid-training datamix; and (b) turning on step-by-step thinking during RL, while not always improving in-domain performance, plays a crucial role in preserving generalization.