敏捷思考:通过强化学习实现社交代理的适应性思维
Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents
May 4, 2025
作者: Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao
cs.AI
摘要
有效的社交智能模擬要求語言代理能夠動態調整推理深度,這一能力在現有方法中顯著缺失。現有方法要么缺乏此類推理能力,要么在所有場景中強制執行統一的長鏈式推理,導致過度的令牌使用和不恰當的社交模擬。本文提出了一種自適應模式學習(AML)方法,該方法根據實時上下文從四種思維模式(直覺反應→深度思考)中進行策略性選擇。我們框架的核心創新——自適應模式策略優化(AMPO)算法,相較於現有方法引入了三個關鍵改進:(1)多粒度思維模式設計,(2)跨社交互動的上下文感知模式切換,以及(3)通過深度自適應處理實現的令牌高效推理。在社交智能任務上的大量實驗證實,AML比最先進的方法實現了15.6%的任務性能提升。值得注意的是,我們的方法在推理鏈長度縮短32.8%的情況下,性能比GRPO高出7.0%。這些結果表明,AMPO中實現的上下文敏感思維模式選擇,相比GRPO的固定深度方法,能夠實現更接近人類的自適應推理。
English
Effective social intelligence simulation requires language agents to
dynamically adjust reasoning depth, a capability notably absent in current
approaches. While existing methods either lack this kind of reasoning
capability or enforce uniform long chain-of-thought reasoning across all
scenarios, resulting in excessive token usage and inappropriate social
simulation. In this paper, we propose Adaptive Mode
Learning (AML) that strategically selects from four
thinking modes (intuitive reaction rightarrow deep contemplation) based on
real-time context. Our framework's core innovation, the Adaptive
Mode Policy Optimization (AMPO)
algorithm, introduces three key advancements over existing methods: (1)
Multi-granular thinking mode design, (2) Context-aware mode switching across
social interaction, and (3) Token-efficient reasoning via depth-adaptive
processing. Extensive experiments on social intelligence tasks confirm that AML
achieves 15.6% higher task performance than state-of-the-art methods. Notably,
our method outperforms GRPO by 7.0% with 32.8% shorter reasoning chains. These
results demonstrate that context-sensitive thinking mode selection, as
implemented in AMPO, enables more human-like adaptive reasoning than GRPO's
fixed-depth approachSummary
AI-Generated Summary