VibeThinker-3B: Исследование границ верифицируемого рассуждения в малых языковых моделях
VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models
June 15, 2026
Авторы: Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang
cs.AI
Аннотация
本技术报告介绍了VibeThinker-3B——一个拥有30亿参数的紧凑型稠密模型,旨在探索在严格的小模型规模下,可验证推理能力能推进到何种程度。基于频谱到信号后训练范式,我们通过优化流程系统性提升模型,该流程包括基于课程学习的监督微调、多领域强化学习以及离线自蒸馏。实验评估表明,VibeThinker-3B在高度严苛的可验证任务上达到了前沿水平。具体而言,它在AIME26上获得94.3分(通过声明级测试时缩放提升至97.1),在LiveCodeBench v6上达到80.2的Pass@1,并在近期未见过的LeetCode竞赛中展现出强大的分布外泛化能力,接受率达到96.1%。这使其有效跻身一流推理系统的性能区间,与规模大数个数量级的旗舰模型(如DeepSeek V3.2、GLM-5和Gemini 3 Pro)相匹敌或更优。此外,IFEval上的93.4分证实,这种极端的推理增强并未损害严格的指令可控性。基于我们之前1.5B模型的研究工作,这些发现推动了参数压缩-覆盖假说的提出,该假说将可验证推理视为可压缩为紧凑推理核心的行为,而开放域知识与通用能力则需要对事实、概念和长尾场景进行广泛的参数覆盖。这一视角表明,紧凑模型不仅是部署效率更高的替代品,更是通向参数稠密能力体系下前沿性能的一条互补路径。
English
This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.