測試時縮放與反射生成模型

摘要

我們推出了首款反射生成模型MetaStone-S1，該模型通過自監督過程獎勵模型（SPRM）獲得了與OpenAI o3相當的性能。通過共享骨幹網絡並分別使用任務專用頭部進行下一個令牌預測和過程評分，SPRM成功將策略模型與過程獎勵模型（PRM）整合到一個統一接口中，無需額外的過程註釋，從而減少了超過99%的PRM參數，實現了高效推理。配備了SPRM的MetaStone-S1自然適合於測試時擴展（TTS），我們基於可控的思考長度提供了三種推理努力模式（低、中、高）。此外，我們實證建立了一條擴展定律，揭示了總思考計算與TTS性能之間的關係。實驗表明，我們的MetaStone-S1僅憑32B參數規模就達到了與OpenAI-o3-mini系列相當的性能。為了支持研究社區，我們已在https://github.com/MetaStone-AI/MetaStone-S1開源了MetaStone-S1。

English

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.

測試時縮放與反射生成模型

Test-Time Scaling with Reflective Generative Model

摘要

Support