風暴前的平靜:解鎖原生推理以實現優化建模
CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling
October 5, 2025
作者: Zhengyang Tang, Zihan Ye, Chenyu Huang, Xuhan Huang, Chengpeng Li, Sihang Li, Guanhua Chen, Ming Yan, Zizhuo Wang, Hongyuan Zha, Dayiheng Liu, Benyou Wang
cs.AI
摘要
大型推理模型(LRMs)在複雜的多步驟推理中展現了強大的能力,為自動化優化建模開闢了新的機會。然而,現有的領域適應方法最初是為早期的指令調優模型設計的,往往無法充分利用現代LRMs的高級推理模式——特別是,我們發現直接對傳統的非反思性數據集進行微調僅能帶來有限的增益。為了充分發揮LRMs固有的推理能力,我們提出了CALM(輕量修正適應框架),這是一個在LRMs原生推理模式下逐步精煉優化建模任務的框架。在CALM中,專家干預者識別推理缺陷並提供簡明的修正提示,LRM則將這些提示融入以生成改進的推理軌跡。這些干預僅修改了不到2.6%的生成標記,但通過監督微調生成了高質量的數據進行軟適應。隨後,適應後的模型通過強化學習進一步改進。基於CALM,我們開發了STORM(智能思維優化推理模型),這是一個擁有40億參數的LRM,在五個流行的優化建模基準測試中達到了68.9%的平均準確率,與一個6710億參數的LRM性能相當。這些結果表明,基於提示的動態數據合成不僅保留了現代LRMs的原生推理模式,還放大了這些模式,為挑戰性優化建模任務提供了一條更有效且可擴展的途徑,以實現專家級性能。
English
Large Reasoning Models (LRMs) have demonstrated strong capabilities in
complex multi-step reasoning, opening new opportunities for automating
optimization modeling. However, existing domain adaptation methods, originally
designed for earlier instruction-tuned models, often fail to exploit the
advanced reasoning patterns of modern LRMs -- In particular, we show that
direct fine-tuning on traditional non-reflective datasets leads to
limited gains. To fully leverage LRMs' inherent reasoning abilities, we propose
CALM (Corrective Adaptation with Lightweight Modification), a
framework that progressively refines LRMs within their native reasoning modes
for optimization modeling tasks. In CALM, an expert intervener identifies
reasoning flaws and provides concise corrective hints, which the LRM
incorporates to produce improved reasoning trajectories. These interventions
modify fewer than 2.6\% of generated tokens, but generate high-quality data for
soft adaptation through supervised fine-tuning. The adapted model is then
further improved through reinforcement learning. Building on CALM, we develop
STORM (Smart Thinking Optimization Reasoning Model), a
4B-parameter LRM that achieves a new state-of-the-art average accuracy of
68.9\% across five popular optimization modeling benchmarks, matching the
performance of a 671B LRM. These results demonstrate that dynamic, hint-based
data synthesis both preserves and amplifies the native reasoning patterns of
modern LRMs, offering a more effective and scalable path towards expert-level
performance on challenging optimization modeling tasks.