减轻泛化税:面向大语言模型智能体的跨领域泛化强化学习研究
Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents
January 26, 2026
作者: Zhihan Liu, Lin Guan, Yixin Nie, Kai Zhang, Zhuoqun Hao, Lin Chen, Asli Celikyilmaz, Zhaoran Wang, Na Zhang
cs.AI
摘要
通用型大语言模型智能体通常在有限的环境中接受后训练,却需部署至更广泛、未见过的领域。本研究针对测试领域未知情况下智能体后训练的挑战展开探讨,重点分析了强化学习环境特性与建模选择对跨领域性能的影响机制。首先,我们识别出与跨领域泛化能力显著相关的两个环境维度:(一)状态信息丰富度,即智能体需从状态中处理的信息量;(二)规划复杂度,通过基础策略下的目标可达性与轨迹长度进行估算。值得注意的是,领域真实性与文本层面相似性并非主要因素——例如在跨领域测试中,简单的网格世界游戏Sokoban反而比更接近现实的ALFWorld在SciWorld中展现出更强的泛化能力。基于这些发现,我们进一步证明仅提升状态信息丰富度即可有效增强跨领域鲁棒性,并提出一种低开销、普适性的随机化技术:在状态中添加少量与目标无关的干扰特征,在不改变任务本质的前提下丰富状态表征。除环境特性外,我们还检验了多种建模选择:(a)监督微调热身或训练中期的介入虽能防止强化学习过程中的灾难性遗忘,但会削弱对未参与中期数据混合的领域的泛化能力;(b)在强化学习中启用逐步推理机制,虽不总能提升领域内性能,但对维持泛化能力具有关键作用。
English
Generalist LLM agents are often post-trained on a narrow set of environments but deployed across far broader, unseen domains. In this work, we investigate the challenge of agentic post-training when the eventual test domains are unknown. Specifically, we analyze which properties of reinforcement learning (RL) environments and modeling choices have the greatest influence on out-of-domain performance. First, we identify two environment axes that strongly correlate with cross-domain generalization: (i) state information richness, i.e., the amount of information for the agent to process from the state, and (ii) planning complexity, estimated via goal reachability and trajectory length under a base policy. Notably, domain realism and text-level similarity are not the primary factors; for instance, the simple grid-world domain Sokoban leads to even stronger generalization in SciWorld than the more realistic ALFWorld. Motivated by these findings, we further show that increasing state information richness alone can already effectively improve cross-domain robustness. We propose a randomization technique, which is low-overhead and broadly applicable: add small amounts of distractive goal-irrelevant features to the state to make it richer without altering the task. Beyond environment-side properties, we also examine several modeling choices: (a) SFT warmup or mid-training helps prevent catastrophic forgetting during RL but undermines generalization to domains that are not included in the mid-training datamix; and (b) turning on step-by-step thinking during RL, while not always improving in-domain performance, plays a crucial role in preserving generalization.