ChatPaper.aiChatPaper

随机应变:基于强化学习的社会智能体自适应思维

Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents

May 4, 2025
作者: Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao
cs.AI

摘要

有效的社交智能模拟要求语言代理能够动态调整推理深度,这一能力在当前方法中明显缺失。现有方法要么缺乏此类推理能力,要么在所有场景中强制采用统一的长链式思维推理,导致过多的令牌使用和不恰当的社交模拟。本文提出自适应模式学习(AML),它根据实时上下文从四种思维模式(直觉反应→深度思考)中策略性地选择。我们框架的核心创新——自适应模式策略优化(AMPO)算法,相比现有方法引入了三项关键改进:(1)多粒度思维模式设计,(2)跨社交互动的上下文感知模式切换,以及(3)通过深度自适应处理实现令牌高效推理。在社交智能任务上的大量实验证实,AML比最先进方法实现了15.6%的任务性能提升。值得注意的是,我们的方法在推理链缩短32.8%的情况下,性能优于GRPO 7.0%。这些结果表明,AMPO中实现的上下文敏感思维模式选择,相比GRPO的固定深度方法,能够实现更接近人类的适应性推理。
English
Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current approaches. While existing methods either lack this kind of reasoning capability or enforce uniform long chain-of-thought reasoning across all scenarios, resulting in excessive token usage and inappropriate social simulation. In this paper, we propose Adaptive Mode Learning (AML) that strategically selects from four thinking modes (intuitive reaction rightarrow deep contemplation) based on real-time context. Our framework's core innovation, the Adaptive Mode Policy Optimization (AMPO) algorithm, introduces three key advancements over existing methods: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing. Extensive experiments on social intelligence tasks confirm that AML achieves 15.6% higher task performance than state-of-the-art methods. Notably, our method outperforms GRPO by 7.0% with 32.8% shorter reasoning chains. These results demonstrate that context-sensitive thinking mode selection, as implemented in AMPO, enables more human-like adaptive reasoning than GRPO's fixed-depth approach

Summary

AI-Generated Summary

PDF161May 6, 2025