반사적 생성 모델을 활용한 테스트 시간 스케일링

초록

우리는 첫 번째 반영적 생성 모델인 MetaStone-S1을 소개합니다. 이 모델은 자기 지도 프로세스 보상 모델(SPRM)을 통해 OpenAI o3의 성능을 달성합니다. 백본 네트워크를 공유하고 다음 토큰 예측과 프로세스 점수 매기기를 위한 작업별 헤드를 각각 사용함으로써, SPRM은 추가적인 프로세스 주석 없이도 정책 모델과 프로세스 보상 모델(PRM)을 통합된 인터페이스로 성공적으로 통합하며, PRM 파라미터의 99% 이상을 줄여 효율적인 추론을 가능하게 합니다. SPRM을 장착한 MetaStone-S1은 테스트 시간 스케일링(TTS)에 자연스럽게 적합하며, 우리는 제어 가능한 사고 길이를 기반으로 세 가지 추론 노력 모드(낮음, 중간, 높음)를 제공합니다. 또한, 우리는 총 사고 계산량과 TTS 성능 간의 관계를 밝히는 스케일링 법칙을 실증적으로 확립했습니다. 실험 결과, 우리의 MetaStone-S1은 32B 파라미터 크기로 OpenAI-o3-mini 시리즈와 비슷한 성능을 달성함을 보여줍니다. 연구 커뮤니티를 지원하기 위해, 우리는 MetaStone-S1을 https://github.com/MetaStone-AI/MetaStone-S1에서 오픈소스로 공개했습니다.

English

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.

반사적 생성 모델을 활용한 테스트 시간 스케일링

Test-Time Scaling with Reflective Generative Model

초록

Support