ARIA：通过意图驱动的奖励聚合训练语言智能体

摘要

大型语言模型（LLMs）使得智能体能够通过自由形式的语言交互执行复杂的推理和决策。然而，在开放式的语言动作环境中（例如，谈判或提问游戏），动作空间可以被表述为基于词汇的联合分布，从而形成一个指数级庞大的动作空间。在此类空间中采样动作可能导致极端的奖励稀疏性，进而带来较大的奖励方差，阻碍有效的强化学习（RL）。为解决这一问题，我们提出了ARIA方法，即通过在意图空间聚合奖励来实现高效且有效的语言智能体训练。ARIA旨在将自然语言动作从高维的词汇联合分布空间映射到低维的意图空间，在此空间中，语义相似的动作被聚类并赋予共享奖励。这种基于意图的奖励聚合通过密集化奖励信号来减少奖励方差，促进更好的策略优化。大量实验表明，ARIA不仅显著降低了策略梯度方差，还在四项下游任务中平均带来了9.95%的性能提升，持续优于离线和在线RL基线方法。

English

Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.

ARIA：通过意图驱动的奖励聚合训练语言智能体

ARIA: Training Language Agents with Intention-Driven Reward Aggregation

摘要

Support