像經濟學家一樣推理：在經濟問題上的後續訓練促使大型語言模型實現策略性泛化

摘要

直接訓練大型語言模型（LLMs）應用於多智能體系統（MAS）仍面臨諸多挑戰，包括複雜的獎勵建模、動態的智能體互動以及嚴苛的泛化需求。本文探討了後訓練技術，特別是監督微調（SFT）和基於可驗證獎勵的強化學習（RLVR），能否有效泛化至多智能體場景。我們以經濟推理作為測試平台，利用其在數學和博弈論中的堅實基礎、對結構化分析推理的需求，以及其在市場設計、資源分配和政策分析等現實應用中的相關性。我們介紹了Recon（像經濟學家一樣推理），這是一個擁有70億參數的開源LLM，基於精心挑選的2,100個高質量經濟推理問題數據集進行後訓練。在經濟推理基準測試和多智能體遊戲上的全面評估顯示，Recon在結構化推理和經濟理性方面有顯著提升。這些結果強調了領域對齊的後訓練在增強推理和智能體對齊方面的潛力，並揭示了SFT和RL在塑造模型行為中的作用。代碼可在https://github.com/MasterZhou1/Recon 獲取。

English

Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively generalize to multi-agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real-world applications such as market design, resource allocation, and policy analysis. We introduce Recon (Reasoning like an ECONomist), a 7B-parameter open-source LLM post-trained on a hand-curated dataset of 2,100 high-quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi-agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain-aligned post-training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at https://github.com/MasterZhou1/Recon .

像經濟學家一樣推理：在經濟問題上的後續訓練促使大型語言模型實現策略性泛化

Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLMs

摘要

Support