VibeThinker-3B:探索小型語言模型中可驗證推理的前沿
VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models
June 15, 2026
作者: Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang
cs.AI
摘要
本技術報告介紹了 VibeThinker-3B,一個具有 3B 參數的緊湊密集模型,旨在探討在嚴格的小模型範疇內,可驗證推理能力能推進至何種極限。基於 Spectrum-to-Signal 後訓練範式,我們透過一套優化管線系統性地增強模型,該管線包括基於課程的監督微調、多域強化學習以及離線自我蒸餾。實驗評估顯示,VibeThinker-3B 在高度要求的可驗證任務上達到了前沿水準。具體而言,在 AIME26 上獲得 94.3 分(透過聲明級測試時擴展可提升至 97.1 分),在 LiveCodeBench v6 上達到 80.2 的 Pass@1,並展現出強大的分佈外泛化能力,在近期未見過的 LeetCode 競賽中達到 96.1% 的接受率。這使其有效躋身一線推理系統的性能區間,與規模大數個數量級的旗艦模型(如 DeepSeek V3.2、GLM-5 和 Gemini 3 Pro)相當甚至超越。此外,IFEval 上 93.4 分的成績確認了這種極端的推理增強並未損害嚴格的指令可控性。延伸我們先前 1.5B 的工作,這些發現催生了參數壓縮-覆蓋假說(Parametric Compression-Coverage Hypothesis),該假說將可驗證推理視為可壓縮至緊湊推理核心的過程,而開放域知識與通用能力則需要對事實、概念及長尾情境進行廣泛的參數覆蓋。此觀點表明,緊湊模型不僅是便於部署的替代方案,更是在參數密集能力範疇中通往前沿性能的互補路徑。
English
This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.