嵐の前の静けさ：最適化のためのネイティブ推論の解放モデリング

要旨

大規模推論モデル（LRM）は、複雑な多段階推論において強力な能力を発揮し、最適化モデリングの自動化に新たな可能性を開いています。しかし、従来の指示調整モデル向けに設計された既存のドメイン適応手法は、現代のLRMの高度な推論パターンを十分に活用できないことが多いです。特に、従来の非反射的データセットに対する直接的なファインチューニングでは、限定的な改善しか得られないことを示します。LRMの内在的な推論能力を最大限に活用するため、我々はCALM（軽量修正を伴う修正的適応）を提案します。これは、最適化モデリングタスクにおいて、LRMをその本来の推論モード内で段階的に洗練させるフレームワークです。CALMでは、専門家の介入者が推論の欠陥を特定し、簡潔な修正ヒントを提供し、LRMがそれを取り入れて改善された推論軌跡を生成します。これらの介入は生成されたトークンの2.6％未満を修正するものの、教師ありファインチューニングを通じたソフト適応のための高品質なデータを生成します。適応されたモデルは、さらに強化学習を通じて改善されます。CALMを基盤として、我々はSTORM（スマート思考最適化推論モデル）を開発しました。これは4BパラメータのLRMであり、5つの人気のある最適化モデリングベンチマークで平均68.9％の新たな最先端の精度を達成し、671BのLRMの性能に匹敵します。これらの結果は、動的でヒントベースのデータ合成が、現代のLRMの本来の推論パターンを保持し、増幅することを示しており、困難な最適化モデリングタスクにおける専門家レベルの性能に向けたより効果的でスケーラブルな道筋を提供します。

English

Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs -- In particular, we show that direct fine-tuning on traditional non-reflective datasets leads to limited gains. To fully leverage LRMs' inherent reasoning abilities, we propose CALM (Corrective Adaptation with Lightweight Modification), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6\% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop STORM (Smart Thinking Optimization Reasoning Model), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9\% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.

嵐の前の静けさ：最適化のためのネイティブ推論の解放モデリング

CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling

要旨

Support