STORM 전의 CALM: 최적화를 위한 본질적 추론 능력의 해방 모델링

초록

대규모 추론 모델(Large Reasoning Models, LRMs)은 복잡한 다단계 추론에서 강력한 능력을 보여주며, 최적화 모델링 자동화를 위한 새로운 가능성을 열었습니다. 그러나 기존의 도메인 적응 방법들은 초기 지시 튜닝 모델을 위해 설계된 경우가 많아, 현대 LRM의 고급 추론 패턴을 제대로 활용하지 못하는 경우가 많습니다. 특히, 우리는 전통적인 비반영적 데이터셋에 대한 직접적인 미세 조정이 제한된 성능 향상만을 가져온다는 것을 보여줍니다. LRM의 내재적 추론 능력을 최대한 활용하기 위해, 우리는 CALM(Corrective Adaptation with Lightweight Modification)이라는 프레임워크를 제안합니다. CALM은 최적화 모델링 작업을 위해 LRM의 기본 추론 모드 내에서 점진적으로 개선하는 방식입니다. CALM에서는 전문가 개입자가 추론 결함을 식별하고 간결한 수정 힌트를 제공하며, LRM은 이를 통합하여 개선된 추론 경로를 생성합니다. 이러한 개입은 생성된 토큰의 2.6% 미만을 수정하지만, 지도 미세 조정을 통한 소프트 적응을 위한 고품질 데이터를 생성합니다. 적응된 모델은 이후 강화 학습을 통해 더욱 개선됩니다. CALM을 기반으로, 우리는 STORM(Smart Thinking Optimization Reasoning Model)을 개발했습니다. STORM은 40억 개의 파라미터를 가진 LRM로, 5개의 인기 있는 최적화 모델링 벤치마크에서 평균 68.9%의 정확도를 달성하여 6710억 개의 파라미터를 가진 LRM의 성능과 맞먹는 새로운 최첨단 성능을 보여줍니다. 이러한 결과는 동적이고 힌트 기반의 데이터 합성이 현대 LRM의 기본 추론 패턴을 보존하고 증폭시켜, 도전적인 최적화 모델링 작업에서 전문가 수준의 성능을 달성하기 위한 더 효과적이고 확장 가능한 경로를 제공한다는 것을 입증합니다.

English

Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs -- In particular, we show that direct fine-tuning on traditional non-reflective datasets leads to limited gains. To fully leverage LRMs' inherent reasoning abilities, we propose CALM (Corrective Adaptation with Lightweight Modification), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6\% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop STORM (Smart Thinking Optimization Reasoning Model), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9\% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.

STORM 전의 CALM: 최적화를 위한 본질적 추론 능력의 해방 모델링

CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling

초록

Support