像经济学家一样推理：针对经济问题的后训练促使大语言模型实现策略性泛化

摘要

直接训练大型语言模型（LLMs）用于多智能体系统（MAS）仍然面临挑战，这源于复杂的奖励建模、动态的智能体交互以及严苛的泛化要求。本文探讨了后训练技术，特别是监督微调（SFT）和可验证奖励的强化学习（RLVR），是否能够有效泛化至多智能体场景。我们以经济推理作为测试平台，利用其在数学和博弈论中的坚实基础、对结构化分析推理的需求，以及其在市场设计、资源分配和政策分析等现实世界应用中的相关性。我们介绍了Recon（像经济学家一样推理），一个拥有70亿参数的开源LLM，基于精心挑选的2,100个高质量经济推理问题数据集进行后训练。在经济推理基准测试和多智能体游戏上的全面评估显示，在结构化推理和经济理性方面均有显著提升。这些结果凸显了领域对齐的后训练在增强推理能力和智能体对齐方面的潜力，同时阐明了SFT和RL在塑造模型行为中的作用。代码可在https://github.com/MasterZhou1/Recon 获取。

English

Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively generalize to multi-agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real-world applications such as market design, resource allocation, and policy analysis. We introduce Recon (Reasoning like an ECONomist), a 7B-parameter open-source LLM post-trained on a hand-curated dataset of 2,100 high-quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi-agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain-aligned post-training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at https://github.com/MasterZhou1/Recon .

像经济学家一样推理：针对经济问题的后训练促使大语言模型实现策略性泛化

Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLMs

摘要

Support