基于反射生成模型的测试时缩放
Test-Time Scaling with Reflective Generative Model
July 2, 2025
作者: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
cs.AI
摘要
我们推出了首款自省生成模型MetaStone-S1,该模型通过自监督过程奖励模型(SPRM)实现了与OpenAI o3相当的性能。通过共享主干网络并分别使用任务专用头部进行下一令牌预测和过程评分,SPRM成功将策略模型与过程奖励模型(PRM)整合至统一接口,无需额外过程标注,显著减少了超过99%的PRM参数,实现了高效推理。配备SPRM的MetaStone-S1天然适用于测试时扩展(TTS),我们基于可控思维长度提供了三种推理努力模式(低、中、高)。此外,我们通过实证建立了一条扩展定律,揭示了总思维计算量与TTS性能之间的关系。实验表明,MetaStone-S1仅以32B参数规模便达到了与OpenAI-o3-mini系列相媲美的性能。为支持研究社区,我们已在https://github.com/MetaStone-AI/MetaStone-S1开源了MetaStone-S1。
English
We introduce our first reflective generative model MetaStone-S1, which
obtains OpenAI o3's performance via the self-supervised process reward model
(SPRM). Through sharing the backbone network and using task-specific heads for
next token prediction and process scoring respectively, SPRM successfully
integrates the policy model and process reward model(PRM) into a unified
interface without extra process annotation, reducing over 99% PRM parameters
for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable
for test time scaling (TTS), and we provide three reasoning effort modes (low,
medium, and high), based on the controllable thinking length. Moreover, we
empirically establish a scaling law that reveals the relationship between total
thinking computation and TTS performance. Experiments demonstrate that our
MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with
only 32B parameter size. To support the research community, we have
open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.