ARIA:通过意图驱动的奖励聚合训练语言智能体
ARIA: Training Language Agents with Intention-Driven Reward Aggregation
May 31, 2025
作者: Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, Yanghua Xiao
cs.AI
摘要
大型语言模型(LLMs)使得智能体能够通过自由形式的语言交互执行复杂的推理和决策。然而,在开放式的语言动作环境中(例如,谈判或提问游戏),动作空间可以被表述为基于词汇的联合分布,从而形成一个指数级庞大的动作空间。在此类空间中采样动作可能导致极端的奖励稀疏性,进而带来较大的奖励方差,阻碍有效的强化学习(RL)。为解决这一问题,我们提出了ARIA方法,即通过在意图空间聚合奖励来实现高效且有效的语言智能体训练。ARIA旨在将自然语言动作从高维的词汇联合分布空间映射到低维的意图空间,在此空间中,语义相似的动作被聚类并赋予共享奖励。这种基于意图的奖励聚合通过密集化奖励信号来减少奖励方差,促进更好的策略优化。大量实验表明,ARIA不仅显著降低了策略梯度方差,还在四项下游任务中平均带来了9.95%的性能提升,持续优于离线和在线RL基线方法。
English
Large language models (LLMs) have enabled agents to perform complex reasoning
and decision-making through free-form language interactions. However, in
open-ended language action environments (e.g., negotiation or question-asking
games), the action space can be formulated as a joint distribution over tokens,
resulting in an exponentially large action space. Sampling actions in such a
space can lead to extreme reward sparsity, which brings large reward variance,
hindering effective reinforcement learning (RL). To address this, we propose
ARIA, a method that Aggregates Rewards in Intention space to enable efficient
and effective language Agents training. ARIA aims to project natural language
actions from the high-dimensional joint token distribution space into a
low-dimensional intention space, where semantically similar actions are
clustered and assigned shared rewards. This intention-aware reward aggregation
reduces reward variance by densifying reward signals, fostering better policy
optimization. Extensive experiments demonstrate that ARIA not only
significantly reduces policy gradient variance, but also delivers substantial
performance gains of an average of 9.95% across four downstream tasks,
consistently outperforming offline and online RL baselines.Summary
AI-Generated Summary