測試時縮放與反射生成模型
Test-Time Scaling with Reflective Generative Model
July 2, 2025
作者: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
cs.AI
摘要
我們推出了首款反射生成模型MetaStone-S1,該模型通過自監督過程獎勵模型(SPRM)獲得了與OpenAI o3相當的性能。通過共享骨幹網絡並分別使用任務專用頭部進行下一個令牌預測和過程評分,SPRM成功將策略模型與過程獎勵模型(PRM)整合到一個統一接口中,無需額外的過程註釋,從而減少了超過99%的PRM參數,實現了高效推理。配備了SPRM的MetaStone-S1自然適合於測試時擴展(TTS),我們基於可控的思考長度提供了三種推理努力模式(低、中、高)。此外,我們實證建立了一條擴展定律,揭示了總思考計算與TTS性能之間的關係。實驗表明,我們的MetaStone-S1僅憑32B參數規模就達到了與OpenAI-o3-mini系列相當的性能。為了支持研究社區,我們已在https://github.com/MetaStone-AI/MetaStone-S1開源了MetaStone-S1。
English
We introduce our first reflective generative model MetaStone-S1, which
obtains OpenAI o3's performance via the self-supervised process reward model
(SPRM). Through sharing the backbone network and using task-specific heads for
next token prediction and process scoring respectively, SPRM successfully
integrates the policy model and process reward model(PRM) into a unified
interface without extra process annotation, reducing over 99% PRM parameters
for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable
for test time scaling (TTS), and we provide three reasoning effort modes (low,
medium, and high), based on the controllable thinking length. Moreover, we
empirically establish a scaling law that reveals the relationship between total
thinking computation and TTS performance. Experiments demonstrate that our
MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with
only 32B parameter size. To support the research community, we have
open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.