ChatPaper.aiChatPaper

风暴前的宁静:释放原生推理能力,优化建模之道

CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling

October 5, 2025
作者: Zhengyang Tang, Zihan Ye, Chenyu Huang, Xuhan Huang, Chengpeng Li, Sihang Li, Guanhua Chen, Ming Yan, Zizhuo Wang, Hongyuan Zha, Dayiheng Liu, Benyou Wang
cs.AI

摘要

大型推理模型(LRMs)在复杂的多步推理任务中展现了强大的能力,为自动化优化建模开辟了新的机遇。然而,现有的领域适应方法最初是为早期指令调优模型设计的,往往无法充分利用现代LRMs的高级推理模式——特别是,我们发现直接对传统的非反思性数据集进行微调带来的提升有限。为了充分挖掘LRMs固有的推理能力,我们提出了CALM(轻量修正的校正适应框架),该框架在优化建模任务中逐步精炼LRMs的推理模式。在CALM中,专家干预者识别推理缺陷并提供简洁的修正提示,LRM则整合这些提示以生成改进的推理轨迹。这些干预仅修改了生成token的不到2.6%,但通过监督微调生成了高质量的数据用于软适应。随后,通过强化学习进一步优化适应后的模型。基于CALM,我们开发了STORM(智能思维优化推理模型),这是一个拥有40亿参数的LRM,在五个流行的优化建模基准测试中达到了68.9%的平均准确率,创下了新的记录,与一个6710亿参数的LRM性能相当。这些结果表明,基于提示的动态数据合成不仅保留了现代LRMs的推理模式,还放大了其推理能力,为在具有挑战性的优化建模任务中实现专家级性能提供了一条更有效且可扩展的路径。
English
Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs -- In particular, we show that direct fine-tuning on traditional non-reflective datasets leads to limited gains. To fully leverage LRMs' inherent reasoning abilities, we propose CALM (Corrective Adaptation with Lightweight Modification), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6\% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop STORM (Smart Thinking Optimization Reasoning Model), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9\% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.
PDF192October 9, 2025